Message ID | 20160628134723.7ffefcf3@t450s.home (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, 28 Jun 2016 13:47:23 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Tue, 28 Jun 2016 18:09:46 +0800 > Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: > > > Hi, Alex > > > > On 2016/6/25 0:43, Alex Williamson wrote: > > > > > On Fri, 24 Jun 2016 23:37:02 +0800 > > > Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: > > > > > >> Hi, Alex > > >> > > >> On 2016/6/24 11:37, Alex Williamson wrote: > > >> > > >>> On Fri, 24 Jun 2016 10:52:58 +0800 > > >>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: > > >>>> On 2016/6/24 0:12, Alex Williamson wrote: > > >>>>> On Mon, 30 May 2016 21:06:37 +0800 > > >>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: > > >>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > > >>>>>> +{ > > >>>>>> + struct resource *res; > > >>>>>> + int bar; > > >>>>>> + struct vfio_pci_dummy_resource *dummy_res; > > >>>>>> + > > >>>>>> + INIT_LIST_HEAD(&vdev->dummy_resources_list); > > >>>>>> + > > >>>>>> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) { > > >>>>>> + res = vdev->pdev->resource + bar; > > >>>>>> + > > >>>>>> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP)) > > >>>>>> + goto no_mmap; > > >>>>>> + > > >>>>>> + if (!(res->flags & IORESOURCE_MEM)) > > >>>>>> + goto no_mmap; > > >>>>>> + > > >>>>>> + /* > > >>>>>> + * The PCI core shouldn't set up a resource with a > > >>>>>> + * type but zero size. But there may be bugs that > > >>>>>> + * cause us to do that. > > >>>>>> + */ > > >>>>>> + if (!resource_size(res)) > > >>>>>> + goto no_mmap; > > >>>>>> + > > >>>>>> + if (resource_size(res) >= PAGE_SIZE) { > > >>>>>> + vdev->bar_mmap_supported[bar] = true; > > >>>>>> + continue; > > >>>>>> + } > > >>>>>> + > > >>>>>> + if (!(res->start & ~PAGE_MASK)) { > > >>>>>> + /* > > >>>>>> + * Add a dummy resource to reserve the remainder > > >>>>>> + * of the exclusive page in case that hot-add > > >>>>>> + * device's bar is assigned into it. > > >>>>>> + */ > > >>>>>> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL); > > >>>>>> + if (dummy_res == NULL) > > >>>>>> + goto no_mmap; > > >>>>>> + > > >>>>>> + dummy_res->resource.start = res->end + 1; > > >>>>>> + dummy_res->resource.end = res->start + PAGE_SIZE - 1; > > >>>>>> + dummy_res->resource.flags = res->flags; > > >>>>>> + if (request_resource(res->parent, > > >>>>>> + &dummy_res->resource)) { > > >>>>>> + kfree(dummy_res); > > >>>>>> + goto no_mmap; > > >>>>>> + } > > >>>>> Isn't it true that request_resource() only tells us that at a given > > >>>>> point in time, no other drivers have reserved that resource? It seems > > >>>>> like it does not guarantee that the resource isn't routed to another > > >>>>> device or that another driver won't at some point attempt to request > > >>>>> that same resource. So for example if a user constructs their initrd > > >>>>> to bind vfio-pci to devices before other modules load, this > > >>>>> request_resource() may succeed, at the expense of drivers loaded later > > >>>>> now failing. The behavior will depend on driver load order and we're > > >>>>> not actually insuring that the overflow resource is unused, just that > > >>>>> we got it first. Can we do better? Am I missing something that > > >>>>> prevents this? Thanks, > > >>>>> > > >>>>> Alex > > >>>> Couldn't PCI resources allocator prevent this, which will find a > > >>>> empty slot in the resource tree firstly, then try to request that > > >>>> resource in allocate_resource() when a PCI device is probed. > > >>>> And I'd like to know why a PCI device driver would attempt to > > >>>> call request_resource()? Should this be done in PCI enumeration? > > >>> Hi Yongji, > > >>> > > >>> Looks like most pci drivers call pci_request_regions(). From there the > > >>> call path is: > > >>> > > >>> pci_request_selected_regions > > >>> __pci_request_selected_regions > > >>> __pci_request_region > > >>> __request_mem_region > > >>> __request_region > > >>> __request_resource > > >>> > > >>> We see this driver ordering issue sometimes with users attempting to > > >>> blacklist native pci drivers, trying to leave a device free for use by > > >>> vfio-pci. If the device is a graphics card, the generic vesa or uefi > > >>> driver can request device resources causing a failure when vfio-pci > > >>> tries to request those same resources. I expect that unless it's a > > >>> boot device, like vga in my example, the resources are not enabled > > >>> until the driver opens the device, therefore the request_resource() call > > >>> doesn't occur until that point. > > >>> > > >>> For another trivial example, look at /proc/iomem as you load and unload > > >>> a driver, on my laptop with e1000e unloaded I see: > > >>> > > >>> e1200000-e121ffff : 0000:00:19.0 > > >>> e123e000-e123efff : 0000:00:19.0 > > >>> > > >>> When e1000e is loaded, each of these becomes claimed by the e1000e > > >>> driver: > > >>> > > >>> e1200000-e121ffff : 0000:00:19.0 > > >>> e1200000-e121ffff : e1000e > > >>> e123e000-e123efff : 0000:00:19.0 > > >>> e123e000-e123efff : e1000e > > >>> > > >>> Clearly pci core knows the resource is associated with the device, but > > >>> I don't think we're tapping into that with request_resource(), we're > > >>> just potentially stealing resources that another driver might have > > >>> claimed otherwise as I described above. That's my suspicion at > > >>> least, feel free to show otherwise if it's incorrect. Thanks, > > >>> > > >>> Alex > > >>> > > >> Thanks for your explanation. But I still have one question. > > >> Shouldn't PCI core have claimed all PCI device's resources > > >> after probing those devices. If so, request_resource() will fail > > >> when vfio-pci try to steal resources that another driver might > > >> request later. Anything I missed here? Some device resources > > >> would not be claimed in PCI core? > > > Hi Yongji, > > > > > > I don't know what to say, this is not how the interface currently > > > works. request_resource() is a driver level interface that tries to > > > prevent drivers from claiming conflicting resources. In this patch > > > you're trying to use it to probe whether a resource maps to another > > > device. This is not what it does. request_resource() will happily let > > > you claim any resource you want, so long as nobody else claimed it > > > first. So the only case where the assumptions in this patch are valid > > > is if we can guarantee that any potentially conflicting device has a > > > driver loaded that has claimed those resources. As it is here, > > > vfio-pci will happily attempt to request resources that might overlap > > > with another device and might break drivers that haven't yet had a > > > chance to probe their devices. I don't think that's acceptable. > > > Thanks, > > > > > > Alex > > > > > > > I'm trying to get your point. Let me give an example here. > > I'm not sure whether my understanding is right. Please > > point it out if I'm wrong. > > > > We assume that there are two PCI devices like this: > > > > 240000000000-24feffffffff : /pciex@3fffe40400000 > > 240000000000-2400ffffffff : PCI Bus 0002:01 > > 240000000000-240000007fff : 0002:01:00.0 > > 240000000000-240000007fff : vfio-pci > > 240000008000-24000000ffff : 0002:01:01.0 > > 240000008000-24000000ffff : lpfc > > > > Do you mean vfio-pci driver will succeed in requesting > > dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K) > > if it is loaded before lpfc driver? Like this: > > > > 240000000000-24feffffffff : /pciex@3fffe40400000 > > 240000000000-2400ffffffff : PCI Bus 0002:01 > > 240000000000-240000007fff : 0002:01:00.0 > > 240000000000-240000007fff : vfio-pci > > 240000008000-24000000ffff : 0002:01:01.0 > > 240000008000-24000000ffff : <BAD> --> vfio-pci call > > request_resource() > > > > Then lpfc driver will fail when it attempts to call > > pci_request_regions() later. > > Yes, that is my supposition. > > > But is it possible that the dummy_res become the child of > > the res: 0002:01:01.0? Wouldn't request_resource() fail when > > it found parent res: PCI Bus 0002:01 already have conflict > > child res: 0002:01:01.0. > > > > And I think the case that request_resource() will succeed > > should like this: > > > > 240000000000-24feffffffff : /pciex@3fffe40400000 > > 240000000000-2400ffffffff : PCI Bus 0002:01 > > 240000000000-240000007fff : 0002:01:00.0 > > 240000010000-240000017fff : 0002:01:01.0 > > > > There is a mem hole: [240000008000-24000000ffff] after > > PCI probing. After loading drivers, the resources tree > > will be: > > > > 240000000000-24feffffffff : /pciex@3fffe40400000 > > 240000000000-2400ffffffff : PCI Bus 0002:01 > > 240000000000-240000007fff : 0002:01:00.0 > > 240000000000-240000007fff : vfio-pci > > 240000008000-24000000ffff : <BAD> ---> vfio-pci call > > request_resource() > > 240000010000-240000017fff : 0002:01:01.0 > > 240000010000-240000017fff : lpfc > > Ok, let's stop guessing about this. I modified your patch as follows > so I could easily test it on a 4k host: > > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) > return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; > } > > +#define VFIO_64K_PAGE_SIZE (64*1024) > +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1)) > + > static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > { > struct resource *res; > @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > if (!resource_size(res)) > goto no_mmap; > > - if (resource_size(res) >= PAGE_SIZE) { > + if (resource_size(res) >= VFIO_64K_PAGE_SIZE) { > vdev->bar_mmap_supported[bar] = true; > continue; > } > > - if (!(res->start & ~PAGE_MASK)) { > + if (!(res->start & ~VFIO_64K_PAGE_MASK)) { > + int ret; > /* > * Add a dummy resource to reserve the remainder > * of the exclusive page in case that hot-add > @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > goto no_mmap; > > dummy_res->resource.start = res->end + 1; > - dummy_res->resource.end = res->start + PAGE_SIZE - 1; > + dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1; > dummy_res->resource.flags = res->flags; > - if (request_resource(res->parent, > - &dummy_res->resource)) { > + ret = request_resource(res->parent, > + &dummy_res->resource); > + if (ret) { > +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret); > kfree(dummy_res); > goto no_mmap; > } > > IOW, enforce 64k for mmap regardless of PAGE_SIZE. Then I find a > system configuration to test it: > > ee400000-ef4fffff : PCI Bus 0000:07 > ef480000-ef49ffff : 0000:07:00.0 > ef480000-ef483fff : 0000:08:10.0 > ef484000-ef487fff : 0000:08:10.2 > ef488000-ef48bfff : 0000:08:10.4 > ef48c000-ef48ffff : 0000:08:10.6 > ef490000-ef493fff : 0000:08:11.0 > ef494000-ef497fff : 0000:08:11.2 > ef498000-ef49bfff : 0000:08:11.4 > ef4a0000-ef4bffff : 0000:07:00.0 > ef4a0000-ef4a3fff : 0000:08:10.0 > ef4a4000-ef4a7fff : 0000:08:10.2 > ef4a8000-ef4abfff : 0000:08:10.4 > ef4ac000-ef4affff : 0000:08:10.6 > ef4b0000-ef4b3fff : 0000:08:11.0 > ef4b4000-ef4b7fff : 0000:08:11.2 > ef4b8000-ef4bbfff : 0000:08:11.4 > > This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all > the VF BAR0s are in a contiguous range and all the VF BAR3s are in a > separate contiguous range. The igbvf driver is not loaded, so the > other resources are free to be claimed, what happens? > > It looks like you're right, the request_resource() fails with: > > vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16) > vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16) > > So we get back -EBUSY which means we hit a conflict. I would have > expected that this means our res->parent that we're using for > request_resource() is only, for instance, ef480000-ef483fff (ie. the > BAR itself) thus our request for ef484000-ef48ffff exceeds the end of > the parent, but adding the parent resource range to the dev_info(), it > actually shows the range being ef480000-ef49ffff, so the parent is > actually the 07:00.0 resource. In fact, we can't even use > request_resource() like this to claim the BAR itself, which is why we > use pci_request_selected_regions(), which allows conflicts, putting the > requested resource at the leaf of the tree. > > So I guess I retract this concern about the use of request_resource(), > it does seem to behave as intended. Unless I can spot anything else or > other reviewers have comments, I'll queue this into my next branch for > v4.8. Thanks, Ok, one more test, I found that I have access to the following USB devices: 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI]) Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K] 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI]) Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K] These are nicely mapped such that vfio-pci can claim the residual space from the page, which results in the following in /proc/iomem: f7a07000-f7a073ff : 0000:00:1d.0 f7a07000-f7a073ff : vfio f7a07400-f7a07fff : <BAD> f7a08000-f7a083ff : 0000:00:1a.0 f7a08000-f7a083ff : vfio f7a08400-f7a08fff : <BAD> I should have looked more closely at your previous reply, I didn't think that "<BAD>" was literally the owner of these dummy resources. Please fix this to report something that isn't going frighten users and make small children cry. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Alex, On 2016/6/30 4:03, Alex Williamson wrote: > On Tue, 28 Jun 2016 13:47:23 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > >> On Tue, 28 Jun 2016 18:09:46 +0800 >> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: >> >>> Hi, Alex >>> >>> On 2016/6/25 0:43, Alex Williamson wrote: >>> >>>> On Fri, 24 Jun 2016 23:37:02 +0800 >>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: >>>> >>>>> Hi, Alex >>>>> >>>>> On 2016/6/24 11:37, Alex Williamson wrote: >>>>> >>>>>> On Fri, 24 Jun 2016 10:52:58 +0800 >>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: >>>>>>> On 2016/6/24 0:12, Alex Williamson wrote: >>>>>>>> On Mon, 30 May 2016 21:06:37 +0800 >>>>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: >>>>>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) >>>>>>>>> +{ >>>>>>>>> + struct resource *res; >>>>>>>>> + int bar; >>>>>>>>> + struct vfio_pci_dummy_resource *dummy_res; >>>>>>>>> + >>>>>>>>> + INIT_LIST_HEAD(&vdev->dummy_resources_list); >>>>>>>>> + >>>>>>>>> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) { >>>>>>>>> + res = vdev->pdev->resource + bar; >>>>>>>>> + >>>>>>>>> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP)) >>>>>>>>> + goto no_mmap; >>>>>>>>> + >>>>>>>>> + if (!(res->flags & IORESOURCE_MEM)) >>>>>>>>> + goto no_mmap; >>>>>>>>> + >>>>>>>>> + /* >>>>>>>>> + * The PCI core shouldn't set up a resource with a >>>>>>>>> + * type but zero size. But there may be bugs that >>>>>>>>> + * cause us to do that. >>>>>>>>> + */ >>>>>>>>> + if (!resource_size(res)) >>>>>>>>> + goto no_mmap; >>>>>>>>> + >>>>>>>>> + if (resource_size(res) >= PAGE_SIZE) { >>>>>>>>> + vdev->bar_mmap_supported[bar] = true; >>>>>>>>> + continue; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + if (!(res->start & ~PAGE_MASK)) { >>>>>>>>> + /* >>>>>>>>> + * Add a dummy resource to reserve the remainder >>>>>>>>> + * of the exclusive page in case that hot-add >>>>>>>>> + * device's bar is assigned into it. >>>>>>>>> + */ >>>>>>>>> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL); >>>>>>>>> + if (dummy_res == NULL) >>>>>>>>> + goto no_mmap; >>>>>>>>> + >>>>>>>>> + dummy_res->resource.start = res->end + 1; >>>>>>>>> + dummy_res->resource.end = res->start + PAGE_SIZE - 1; >>>>>>>>> + dummy_res->resource.flags = res->flags; >>>>>>>>> + if (request_resource(res->parent, >>>>>>>>> + &dummy_res->resource)) { >>>>>>>>> + kfree(dummy_res); >>>>>>>>> + goto no_mmap; >>>>>>>>> + } >>>>>>>> Isn't it true that request_resource() only tells us that at a given >>>>>>>> point in time, no other drivers have reserved that resource? It seems >>>>>>>> like it does not guarantee that the resource isn't routed to another >>>>>>>> device or that another driver won't at some point attempt to request >>>>>>>> that same resource. So for example if a user constructs their initrd >>>>>>>> to bind vfio-pci to devices before other modules load, this >>>>>>>> request_resource() may succeed, at the expense of drivers loaded later >>>>>>>> now failing. The behavior will depend on driver load order and we're >>>>>>>> not actually insuring that the overflow resource is unused, just that >>>>>>>> we got it first. Can we do better? Am I missing something that >>>>>>>> prevents this? Thanks, >>>>>>>> >>>>>>>> Alex >>>>>>> Couldn't PCI resources allocator prevent this, which will find a >>>>>>> empty slot in the resource tree firstly, then try to request that >>>>>>> resource in allocate_resource() when a PCI device is probed. >>>>>>> And I'd like to know why a PCI device driver would attempt to >>>>>>> call request_resource()? Should this be done in PCI enumeration? >>>>>> Hi Yongji, >>>>>> >>>>>> Looks like most pci drivers call pci_request_regions(). From there the >>>>>> call path is: >>>>>> >>>>>> pci_request_selected_regions >>>>>> __pci_request_selected_regions >>>>>> __pci_request_region >>>>>> __request_mem_region >>>>>> __request_region >>>>>> __request_resource >>>>>> >>>>>> We see this driver ordering issue sometimes with users attempting to >>>>>> blacklist native pci drivers, trying to leave a device free for use by >>>>>> vfio-pci. If the device is a graphics card, the generic vesa or uefi >>>>>> driver can request device resources causing a failure when vfio-pci >>>>>> tries to request those same resources. I expect that unless it's a >>>>>> boot device, like vga in my example, the resources are not enabled >>>>>> until the driver opens the device, therefore the request_resource() call >>>>>> doesn't occur until that point. >>>>>> >>>>>> For another trivial example, look at /proc/iomem as you load and unload >>>>>> a driver, on my laptop with e1000e unloaded I see: >>>>>> >>>>>> e1200000-e121ffff : 0000:00:19.0 >>>>>> e123e000-e123efff : 0000:00:19.0 >>>>>> >>>>>> When e1000e is loaded, each of these becomes claimed by the e1000e >>>>>> driver: >>>>>> >>>>>> e1200000-e121ffff : 0000:00:19.0 >>>>>> e1200000-e121ffff : e1000e >>>>>> e123e000-e123efff : 0000:00:19.0 >>>>>> e123e000-e123efff : e1000e >>>>>> >>>>>> Clearly pci core knows the resource is associated with the device, but >>>>>> I don't think we're tapping into that with request_resource(), we're >>>>>> just potentially stealing resources that another driver might have >>>>>> claimed otherwise as I described above. That's my suspicion at >>>>>> least, feel free to show otherwise if it's incorrect. Thanks, >>>>>> >>>>>> Alex >>>>>> >>>>> Thanks for your explanation. But I still have one question. >>>>> Shouldn't PCI core have claimed all PCI device's resources >>>>> after probing those devices. If so, request_resource() will fail >>>>> when vfio-pci try to steal resources that another driver might >>>>> request later. Anything I missed here? Some device resources >>>>> would not be claimed in PCI core? >>>> Hi Yongji, >>>> >>>> I don't know what to say, this is not how the interface currently >>>> works. request_resource() is a driver level interface that tries to >>>> prevent drivers from claiming conflicting resources. In this patch >>>> you're trying to use it to probe whether a resource maps to another >>>> device. This is not what it does. request_resource() will happily let >>>> you claim any resource you want, so long as nobody else claimed it >>>> first. So the only case where the assumptions in this patch are valid >>>> is if we can guarantee that any potentially conflicting device has a >>>> driver loaded that has claimed those resources. As it is here, >>>> vfio-pci will happily attempt to request resources that might overlap >>>> with another device and might break drivers that haven't yet had a >>>> chance to probe their devices. I don't think that's acceptable. >>>> Thanks, >>>> >>>> Alex >>>> >>> I'm trying to get your point. Let me give an example here. >>> I'm not sure whether my understanding is right. Please >>> point it out if I'm wrong. >>> >>> We assume that there are two PCI devices like this: >>> >>> 240000000000-24feffffffff : /pciex@3fffe40400000 >>> 240000000000-2400ffffffff : PCI Bus 0002:01 >>> 240000000000-240000007fff : 0002:01:00.0 >>> 240000000000-240000007fff : vfio-pci >>> 240000008000-24000000ffff : 0002:01:01.0 >>> 240000008000-24000000ffff : lpfc >>> >>> Do you mean vfio-pci driver will succeed in requesting >>> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K) >>> if it is loaded before lpfc driver? Like this: >>> >>> 240000000000-24feffffffff : /pciex@3fffe40400000 >>> 240000000000-2400ffffffff : PCI Bus 0002:01 >>> 240000000000-240000007fff : 0002:01:00.0 >>> 240000000000-240000007fff : vfio-pci >>> 240000008000-24000000ffff : 0002:01:01.0 >>> 240000008000-24000000ffff : <BAD> --> vfio-pci call >>> request_resource() >>> >>> Then lpfc driver will fail when it attempts to call >>> pci_request_regions() later. >> Yes, that is my supposition. >> >>> But is it possible that the dummy_res become the child of >>> the res: 0002:01:01.0? Wouldn't request_resource() fail when >>> it found parent res: PCI Bus 0002:01 already have conflict >>> child res: 0002:01:01.0. >>> >>> And I think the case that request_resource() will succeed >>> should like this: >>> >>> 240000000000-24feffffffff : /pciex@3fffe40400000 >>> 240000000000-2400ffffffff : PCI Bus 0002:01 >>> 240000000000-240000007fff : 0002:01:00.0 >>> 240000010000-240000017fff : 0002:01:01.0 >>> >>> There is a mem hole: [240000008000-24000000ffff] after >>> PCI probing. After loading drivers, the resources tree >>> will be: >>> >>> 240000000000-24feffffffff : /pciex@3fffe40400000 >>> 240000000000-2400ffffffff : PCI Bus 0002:01 >>> 240000000000-240000007fff : 0002:01:00.0 >>> 240000000000-240000007fff : vfio-pci >>> 240000008000-24000000ffff : <BAD> ---> vfio-pci call >>> request_resource() >>> 240000010000-240000017fff : 0002:01:01.0 >>> 240000010000-240000017fff : lpfc >> Ok, let's stop guessing about this. I modified your patch as follows >> so I could easily test it on a 4k host: >> >> --- a/drivers/vfio/pci/vfio_pci.c >> +++ b/drivers/vfio/pci/vfio_pci.c >> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) >> return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; >> } >> >> +#define VFIO_64K_PAGE_SIZE (64*1024) >> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1)) >> + >> static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) >> { >> struct resource *res; >> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) >> if (!resource_size(res)) >> goto no_mmap; >> >> - if (resource_size(res) >= PAGE_SIZE) { >> + if (resource_size(res) >= VFIO_64K_PAGE_SIZE) { >> vdev->bar_mmap_supported[bar] = true; >> continue; >> } >> >> - if (!(res->start & ~PAGE_MASK)) { >> + if (!(res->start & ~VFIO_64K_PAGE_MASK)) { >> + int ret; >> /* >> * Add a dummy resource to reserve the remainder >> * of the exclusive page in case that hot-add >> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) >> goto no_mmap; >> >> dummy_res->resource.start = res->end + 1; >> - dummy_res->resource.end = res->start + PAGE_SIZE - 1; >> + dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1; >> dummy_res->resource.flags = res->flags; >> - if (request_resource(res->parent, >> - &dummy_res->resource)) { >> + ret = request_resource(res->parent, >> + &dummy_res->resource); >> + if (ret) { >> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret); >> kfree(dummy_res); >> goto no_mmap; >> } >> >> IOW, enforce 64k for mmap regardless of PAGE_SIZE. Then I find a >> system configuration to test it: >> >> ee400000-ef4fffff : PCI Bus 0000:07 >> ef480000-ef49ffff : 0000:07:00.0 >> ef480000-ef483fff : 0000:08:10.0 >> ef484000-ef487fff : 0000:08:10.2 >> ef488000-ef48bfff : 0000:08:10.4 >> ef48c000-ef48ffff : 0000:08:10.6 >> ef490000-ef493fff : 0000:08:11.0 >> ef494000-ef497fff : 0000:08:11.2 >> ef498000-ef49bfff : 0000:08:11.4 >> ef4a0000-ef4bffff : 0000:07:00.0 >> ef4a0000-ef4a3fff : 0000:08:10.0 >> ef4a4000-ef4a7fff : 0000:08:10.2 >> ef4a8000-ef4abfff : 0000:08:10.4 >> ef4ac000-ef4affff : 0000:08:10.6 >> ef4b0000-ef4b3fff : 0000:08:11.0 >> ef4b4000-ef4b7fff : 0000:08:11.2 >> ef4b8000-ef4bbfff : 0000:08:11.4 >> >> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all >> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a >> separate contiguous range. The igbvf driver is not loaded, so the >> other resources are free to be claimed, what happens? >> >> It looks like you're right, the request_resource() fails with: >> >> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16) >> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16) >> >> So we get back -EBUSY which means we hit a conflict. I would have >> expected that this means our res->parent that we're using for >> request_resource() is only, for instance, ef480000-ef483fff (ie. the >> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of >> the parent, but adding the parent resource range to the dev_info(), it >> actually shows the range being ef480000-ef49ffff, so the parent is >> actually the 07:00.0 resource. In fact, we can't even use >> request_resource() like this to claim the BAR itself, which is why we >> use pci_request_selected_regions(), which allows conflicts, putting the >> requested resource at the leaf of the tree. >> >> So I guess I retract this concern about the use of request_resource(), >> it does seem to behave as intended. Unless I can spot anything else or >> other reviewers have comments, I'll queue this into my next branch for >> v4.8. Thanks, > > Ok, one more test, I found that I have access to the following USB > devices: > > 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI]) > Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K] > > 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI]) > Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K] > > These are nicely mapped such that vfio-pci can claim the residual space > from the page, which results in the following in /proc/iomem: > > f7a07000-f7a073ff : 0000:00:1d.0 > f7a07000-f7a073ff : vfio > f7a07400-f7a07fff : <BAD> > f7a08000-f7a083ff : 0000:00:1a.0 > f7a08000-f7a083ff : vfio > f7a08400-f7a08fff : <BAD> > > I should have looked more closely at your previous reply, I didn't > think that "<BAD>" was literally the owner of these dummy resources. > Please fix this to report something that isn't going frighten users > and make small children cry. Thanks, Yeah, I also noticed that:-). Now I'm trying to find a proper name. What do you think about "vfio-pci (dummy)"? Thanks, Yongji -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 30 Jun 2016 10:40:23 +0800 Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: > Hi Alex, > > On 2016/6/30 4:03, Alex Williamson wrote: > > > On Tue, 28 Jun 2016 13:47:23 -0600 > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > >> On Tue, 28 Jun 2016 18:09:46 +0800 > >> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: > >> > >>> Hi, Alex > >>> > >>> On 2016/6/25 0:43, Alex Williamson wrote: > >>> > >>>> On Fri, 24 Jun 2016 23:37:02 +0800 > >>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: > >>>> > >>>>> Hi, Alex > >>>>> > >>>>> On 2016/6/24 11:37, Alex Williamson wrote: > >>>>> > >>>>>> On Fri, 24 Jun 2016 10:52:58 +0800 > >>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: > >>>>>>> On 2016/6/24 0:12, Alex Williamson wrote: > >>>>>>>> On Mon, 30 May 2016 21:06:37 +0800 > >>>>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: > >>>>>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > >>>>>>>>> +{ > >>>>>>>>> + struct resource *res; > >>>>>>>>> + int bar; > >>>>>>>>> + struct vfio_pci_dummy_resource *dummy_res; > >>>>>>>>> + > >>>>>>>>> + INIT_LIST_HEAD(&vdev->dummy_resources_list); > >>>>>>>>> + > >>>>>>>>> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) { > >>>>>>>>> + res = vdev->pdev->resource + bar; > >>>>>>>>> + > >>>>>>>>> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP)) > >>>>>>>>> + goto no_mmap; > >>>>>>>>> + > >>>>>>>>> + if (!(res->flags & IORESOURCE_MEM)) > >>>>>>>>> + goto no_mmap; > >>>>>>>>> + > >>>>>>>>> + /* > >>>>>>>>> + * The PCI core shouldn't set up a resource with a > >>>>>>>>> + * type but zero size. But there may be bugs that > >>>>>>>>> + * cause us to do that. > >>>>>>>>> + */ > >>>>>>>>> + if (!resource_size(res)) > >>>>>>>>> + goto no_mmap; > >>>>>>>>> + > >>>>>>>>> + if (resource_size(res) >= PAGE_SIZE) { > >>>>>>>>> + vdev->bar_mmap_supported[bar] = true; > >>>>>>>>> + continue; > >>>>>>>>> + } > >>>>>>>>> + > >>>>>>>>> + if (!(res->start & ~PAGE_MASK)) { > >>>>>>>>> + /* > >>>>>>>>> + * Add a dummy resource to reserve the remainder > >>>>>>>>> + * of the exclusive page in case that hot-add > >>>>>>>>> + * device's bar is assigned into it. > >>>>>>>>> + */ > >>>>>>>>> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL); > >>>>>>>>> + if (dummy_res == NULL) > >>>>>>>>> + goto no_mmap; > >>>>>>>>> + > >>>>>>>>> + dummy_res->resource.start = res->end + 1; > >>>>>>>>> + dummy_res->resource.end = res->start + PAGE_SIZE - 1; > >>>>>>>>> + dummy_res->resource.flags = res->flags; > >>>>>>>>> + if (request_resource(res->parent, > >>>>>>>>> + &dummy_res->resource)) { > >>>>>>>>> + kfree(dummy_res); > >>>>>>>>> + goto no_mmap; > >>>>>>>>> + } > >>>>>>>> Isn't it true that request_resource() only tells us that at a given > >>>>>>>> point in time, no other drivers have reserved that resource? It seems > >>>>>>>> like it does not guarantee that the resource isn't routed to another > >>>>>>>> device or that another driver won't at some point attempt to request > >>>>>>>> that same resource. So for example if a user constructs their initrd > >>>>>>>> to bind vfio-pci to devices before other modules load, this > >>>>>>>> request_resource() may succeed, at the expense of drivers loaded later > >>>>>>>> now failing. The behavior will depend on driver load order and we're > >>>>>>>> not actually insuring that the overflow resource is unused, just that > >>>>>>>> we got it first. Can we do better? Am I missing something that > >>>>>>>> prevents this? Thanks, > >>>>>>>> > >>>>>>>> Alex > >>>>>>> Couldn't PCI resources allocator prevent this, which will find a > >>>>>>> empty slot in the resource tree firstly, then try to request that > >>>>>>> resource in allocate_resource() when a PCI device is probed. > >>>>>>> And I'd like to know why a PCI device driver would attempt to > >>>>>>> call request_resource()? Should this be done in PCI enumeration? > >>>>>> Hi Yongji, > >>>>>> > >>>>>> Looks like most pci drivers call pci_request_regions(). From there the > >>>>>> call path is: > >>>>>> > >>>>>> pci_request_selected_regions > >>>>>> __pci_request_selected_regions > >>>>>> __pci_request_region > >>>>>> __request_mem_region > >>>>>> __request_region > >>>>>> __request_resource > >>>>>> > >>>>>> We see this driver ordering issue sometimes with users attempting to > >>>>>> blacklist native pci drivers, trying to leave a device free for use by > >>>>>> vfio-pci. If the device is a graphics card, the generic vesa or uefi > >>>>>> driver can request device resources causing a failure when vfio-pci > >>>>>> tries to request those same resources. I expect that unless it's a > >>>>>> boot device, like vga in my example, the resources are not enabled > >>>>>> until the driver opens the device, therefore the request_resource() call > >>>>>> doesn't occur until that point. > >>>>>> > >>>>>> For another trivial example, look at /proc/iomem as you load and unload > >>>>>> a driver, on my laptop with e1000e unloaded I see: > >>>>>> > >>>>>> e1200000-e121ffff : 0000:00:19.0 > >>>>>> e123e000-e123efff : 0000:00:19.0 > >>>>>> > >>>>>> When e1000e is loaded, each of these becomes claimed by the e1000e > >>>>>> driver: > >>>>>> > >>>>>> e1200000-e121ffff : 0000:00:19.0 > >>>>>> e1200000-e121ffff : e1000e > >>>>>> e123e000-e123efff : 0000:00:19.0 > >>>>>> e123e000-e123efff : e1000e > >>>>>> > >>>>>> Clearly pci core knows the resource is associated with the device, but > >>>>>> I don't think we're tapping into that with request_resource(), we're > >>>>>> just potentially stealing resources that another driver might have > >>>>>> claimed otherwise as I described above. That's my suspicion at > >>>>>> least, feel free to show otherwise if it's incorrect. Thanks, > >>>>>> > >>>>>> Alex > >>>>>> > >>>>> Thanks for your explanation. But I still have one question. > >>>>> Shouldn't PCI core have claimed all PCI device's resources > >>>>> after probing those devices. If so, request_resource() will fail > >>>>> when vfio-pci try to steal resources that another driver might > >>>>> request later. Anything I missed here? Some device resources > >>>>> would not be claimed in PCI core? > >>>> Hi Yongji, > >>>> > >>>> I don't know what to say, this is not how the interface currently > >>>> works. request_resource() is a driver level interface that tries to > >>>> prevent drivers from claiming conflicting resources. In this patch > >>>> you're trying to use it to probe whether a resource maps to another > >>>> device. This is not what it does. request_resource() will happily let > >>>> you claim any resource you want, so long as nobody else claimed it > >>>> first. So the only case where the assumptions in this patch are valid > >>>> is if we can guarantee that any potentially conflicting device has a > >>>> driver loaded that has claimed those resources. As it is here, > >>>> vfio-pci will happily attempt to request resources that might overlap > >>>> with another device and might break drivers that haven't yet had a > >>>> chance to probe their devices. I don't think that's acceptable. > >>>> Thanks, > >>>> > >>>> Alex > >>>> > >>> I'm trying to get your point. Let me give an example here. > >>> I'm not sure whether my understanding is right. Please > >>> point it out if I'm wrong. > >>> > >>> We assume that there are two PCI devices like this: > >>> > >>> 240000000000-24feffffffff : /pciex@3fffe40400000 > >>> 240000000000-2400ffffffff : PCI Bus 0002:01 > >>> 240000000000-240000007fff : 0002:01:00.0 > >>> 240000000000-240000007fff : vfio-pci > >>> 240000008000-24000000ffff : 0002:01:01.0 > >>> 240000008000-24000000ffff : lpfc > >>> > >>> Do you mean vfio-pci driver will succeed in requesting > >>> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K) > >>> if it is loaded before lpfc driver? Like this: > >>> > >>> 240000000000-24feffffffff : /pciex@3fffe40400000 > >>> 240000000000-2400ffffffff : PCI Bus 0002:01 > >>> 240000000000-240000007fff : 0002:01:00.0 > >>> 240000000000-240000007fff : vfio-pci > >>> 240000008000-24000000ffff : 0002:01:01.0 > >>> 240000008000-24000000ffff : <BAD> --> vfio-pci call > >>> request_resource() > >>> > >>> Then lpfc driver will fail when it attempts to call > >>> pci_request_regions() later. > >> Yes, that is my supposition. > >> > >>> But is it possible that the dummy_res become the child of > >>> the res: 0002:01:01.0? Wouldn't request_resource() fail when > >>> it found parent res: PCI Bus 0002:01 already have conflict > >>> child res: 0002:01:01.0. > >>> > >>> And I think the case that request_resource() will succeed > >>> should like this: > >>> > >>> 240000000000-24feffffffff : /pciex@3fffe40400000 > >>> 240000000000-2400ffffffff : PCI Bus 0002:01 > >>> 240000000000-240000007fff : 0002:01:00.0 > >>> 240000010000-240000017fff : 0002:01:01.0 > >>> > >>> There is a mem hole: [240000008000-24000000ffff] after > >>> PCI probing. After loading drivers, the resources tree > >>> will be: > >>> > >>> 240000000000-24feffffffff : /pciex@3fffe40400000 > >>> 240000000000-2400ffffffff : PCI Bus 0002:01 > >>> 240000000000-240000007fff : 0002:01:00.0 > >>> 240000000000-240000007fff : vfio-pci > >>> 240000008000-24000000ffff : <BAD> ---> vfio-pci call > >>> request_resource() > >>> 240000010000-240000017fff : 0002:01:01.0 > >>> 240000010000-240000017fff : lpfc > >> Ok, let's stop guessing about this. I modified your patch as follows > >> so I could easily test it on a 4k host: > >> > >> --- a/drivers/vfio/pci/vfio_pci.c > >> +++ b/drivers/vfio/pci/vfio_pci.c > >> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) > >> return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; > >> } > >> > >> +#define VFIO_64K_PAGE_SIZE (64*1024) > >> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1)) > >> + > >> static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > >> { > >> struct resource *res; > >> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > >> if (!resource_size(res)) > >> goto no_mmap; > >> > >> - if (resource_size(res) >= PAGE_SIZE) { > >> + if (resource_size(res) >= VFIO_64K_PAGE_SIZE) { > >> vdev->bar_mmap_supported[bar] = true; > >> continue; > >> } > >> > >> - if (!(res->start & ~PAGE_MASK)) { > >> + if (!(res->start & ~VFIO_64K_PAGE_MASK)) { > >> + int ret; > >> /* > >> * Add a dummy resource to reserve the remainder > >> * of the exclusive page in case that hot-add > >> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > >> goto no_mmap; > >> > >> dummy_res->resource.start = res->end + 1; > >> - dummy_res->resource.end = res->start + PAGE_SIZE - 1; > >> + dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1; > >> dummy_res->resource.flags = res->flags; > >> - if (request_resource(res->parent, > >> - &dummy_res->resource)) { > >> + ret = request_resource(res->parent, > >> + &dummy_res->resource); > >> + if (ret) { > >> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret); > >> kfree(dummy_res); > >> goto no_mmap; > >> } > >> > >> IOW, enforce 64k for mmap regardless of PAGE_SIZE. Then I find a > >> system configuration to test it: > >> > >> ee400000-ef4fffff : PCI Bus 0000:07 > >> ef480000-ef49ffff : 0000:07:00.0 > >> ef480000-ef483fff : 0000:08:10.0 > >> ef484000-ef487fff : 0000:08:10.2 > >> ef488000-ef48bfff : 0000:08:10.4 > >> ef48c000-ef48ffff : 0000:08:10.6 > >> ef490000-ef493fff : 0000:08:11.0 > >> ef494000-ef497fff : 0000:08:11.2 > >> ef498000-ef49bfff : 0000:08:11.4 > >> ef4a0000-ef4bffff : 0000:07:00.0 > >> ef4a0000-ef4a3fff : 0000:08:10.0 > >> ef4a4000-ef4a7fff : 0000:08:10.2 > >> ef4a8000-ef4abfff : 0000:08:10.4 > >> ef4ac000-ef4affff : 0000:08:10.6 > >> ef4b0000-ef4b3fff : 0000:08:11.0 > >> ef4b4000-ef4b7fff : 0000:08:11.2 > >> ef4b8000-ef4bbfff : 0000:08:11.4 > >> > >> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all > >> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a > >> separate contiguous range. The igbvf driver is not loaded, so the > >> other resources are free to be claimed, what happens? > >> > >> It looks like you're right, the request_resource() fails with: > >> > >> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16) > >> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16) > >> > >> So we get back -EBUSY which means we hit a conflict. I would have > >> expected that this means our res->parent that we're using for > >> request_resource() is only, for instance, ef480000-ef483fff (ie. the > >> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of > >> the parent, but adding the parent resource range to the dev_info(), it > >> actually shows the range being ef480000-ef49ffff, so the parent is > >> actually the 07:00.0 resource. In fact, we can't even use > >> request_resource() like this to claim the BAR itself, which is why we > >> use pci_request_selected_regions(), which allows conflicts, putting the > >> requested resource at the leaf of the tree. > >> > >> So I guess I retract this concern about the use of request_resource(), > >> it does seem to behave as intended. Unless I can spot anything else or > >> other reviewers have comments, I'll queue this into my next branch for > >> v4.8. Thanks, > > > > Ok, one more test, I found that I have access to the following USB > > devices: > > > > 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI]) > > Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K] > > > > 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI]) > > Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K] > > > > These are nicely mapped such that vfio-pci can claim the residual space > > from the page, which results in the following in /proc/iomem: > > > > f7a07000-f7a073ff : 0000:00:1d.0 > > f7a07000-f7a073ff : vfio > > f7a07400-f7a07fff : <BAD> > > f7a08000-f7a083ff : 0000:00:1a.0 > > f7a08000-f7a083ff : vfio > > f7a08400-f7a08fff : <BAD> > > > > I should have looked more closely at your previous reply, I didn't > > think that "<BAD>" was literally the owner of these dummy resources. > > Please fix this to report something that isn't going frighten users > > and make small children cry. Thanks, > > Yeah, I also noticed that:-). Now I'm trying to find a proper > name. What do you think about "vfio-pci (dummy)"? How about "vfio sub-page reserved"? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016/6/30 10:53, Alex Williamson wrote: > On Thu, 30 Jun 2016 10:40:23 +0800 > Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: > >> Hi Alex, >> >> On 2016/6/30 4:03, Alex Williamson wrote: >> >>> On Tue, 28 Jun 2016 13:47:23 -0600 >>> Alex Williamson <alex.williamson@redhat.com> wrote: >>> >>>> On Tue, 28 Jun 2016 18:09:46 +0800 >>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: >>>> >>>>> Hi, Alex >>>>> >>>>> On 2016/6/25 0:43, Alex Williamson wrote: >>>>> >>>>>> On Fri, 24 Jun 2016 23:37:02 +0800 >>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: >>>>>> >>>>>>> Hi, Alex >>>>>>> >>>>>>> On 2016/6/24 11:37, Alex Williamson wrote: >>>>>>> >>>>>>>> On Fri, 24 Jun 2016 10:52:58 +0800 >>>>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: >>>>>>>>> On 2016/6/24 0:12, Alex Williamson wrote: >>>>>>>>>> On Mon, 30 May 2016 21:06:37 +0800 >>>>>>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote: >>>>>>>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) >>>>>>>>>>> +{ >>>>>>>>>>> + struct resource *res; >>>>>>>>>>> + int bar; >>>>>>>>>>> + struct vfio_pci_dummy_resource *dummy_res; >>>>>>>>>>> + >>>>>>>>>>> + INIT_LIST_HEAD(&vdev->dummy_resources_list); >>>>>>>>>>> + >>>>>>>>>>> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) { >>>>>>>>>>> + res = vdev->pdev->resource + bar; >>>>>>>>>>> + >>>>>>>>>>> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP)) >>>>>>>>>>> + goto no_mmap; >>>>>>>>>>> + >>>>>>>>>>> + if (!(res->flags & IORESOURCE_MEM)) >>>>>>>>>>> + goto no_mmap; >>>>>>>>>>> + >>>>>>>>>>> + /* >>>>>>>>>>> + * The PCI core shouldn't set up a resource with a >>>>>>>>>>> + * type but zero size. But there may be bugs that >>>>>>>>>>> + * cause us to do that. >>>>>>>>>>> + */ >>>>>>>>>>> + if (!resource_size(res)) >>>>>>>>>>> + goto no_mmap; >>>>>>>>>>> + >>>>>>>>>>> + if (resource_size(res) >= PAGE_SIZE) { >>>>>>>>>>> + vdev->bar_mmap_supported[bar] = true; >>>>>>>>>>> + continue; >>>>>>>>>>> + } >>>>>>>>>>> + >>>>>>>>>>> + if (!(res->start & ~PAGE_MASK)) { >>>>>>>>>>> + /* >>>>>>>>>>> + * Add a dummy resource to reserve the remainder >>>>>>>>>>> + * of the exclusive page in case that hot-add >>>>>>>>>>> + * device's bar is assigned into it. >>>>>>>>>>> + */ >>>>>>>>>>> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL); >>>>>>>>>>> + if (dummy_res == NULL) >>>>>>>>>>> + goto no_mmap; >>>>>>>>>>> + >>>>>>>>>>> + dummy_res->resource.start = res->end + 1; >>>>>>>>>>> + dummy_res->resource.end = res->start + PAGE_SIZE - 1; >>>>>>>>>>> + dummy_res->resource.flags = res->flags; >>>>>>>>>>> + if (request_resource(res->parent, >>>>>>>>>>> + &dummy_res->resource)) { >>>>>>>>>>> + kfree(dummy_res); >>>>>>>>>>> + goto no_mmap; >>>>>>>>>>> + } >>>>>>>>>> Isn't it true that request_resource() only tells us that at a given >>>>>>>>>> point in time, no other drivers have reserved that resource? It seems >>>>>>>>>> like it does not guarantee that the resource isn't routed to another >>>>>>>>>> device or that another driver won't at some point attempt to request >>>>>>>>>> that same resource. So for example if a user constructs their initrd >>>>>>>>>> to bind vfio-pci to devices before other modules load, this >>>>>>>>>> request_resource() may succeed, at the expense of drivers loaded later >>>>>>>>>> now failing. The behavior will depend on driver load order and we're >>>>>>>>>> not actually insuring that the overflow resource is unused, just that >>>>>>>>>> we got it first. Can we do better? Am I missing something that >>>>>>>>>> prevents this? Thanks, >>>>>>>>>> >>>>>>>>>> Alex >>>>>>>>> Couldn't PCI resources allocator prevent this, which will find a >>>>>>>>> empty slot in the resource tree firstly, then try to request that >>>>>>>>> resource in allocate_resource() when a PCI device is probed. >>>>>>>>> And I'd like to know why a PCI device driver would attempt to >>>>>>>>> call request_resource()? Should this be done in PCI enumeration? >>>>>>>> Hi Yongji, >>>>>>>> >>>>>>>> Looks like most pci drivers call pci_request_regions(). From there the >>>>>>>> call path is: >>>>>>>> >>>>>>>> pci_request_selected_regions >>>>>>>> __pci_request_selected_regions >>>>>>>> __pci_request_region >>>>>>>> __request_mem_region >>>>>>>> __request_region >>>>>>>> __request_resource >>>>>>>> >>>>>>>> We see this driver ordering issue sometimes with users attempting to >>>>>>>> blacklist native pci drivers, trying to leave a device free for use by >>>>>>>> vfio-pci. If the device is a graphics card, the generic vesa or uefi >>>>>>>> driver can request device resources causing a failure when vfio-pci >>>>>>>> tries to request those same resources. I expect that unless it's a >>>>>>>> boot device, like vga in my example, the resources are not enabled >>>>>>>> until the driver opens the device, therefore the request_resource() call >>>>>>>> doesn't occur until that point. >>>>>>>> >>>>>>>> For another trivial example, look at /proc/iomem as you load and unload >>>>>>>> a driver, on my laptop with e1000e unloaded I see: >>>>>>>> >>>>>>>> e1200000-e121ffff : 0000:00:19.0 >>>>>>>> e123e000-e123efff : 0000:00:19.0 >>>>>>>> >>>>>>>> When e1000e is loaded, each of these becomes claimed by the e1000e >>>>>>>> driver: >>>>>>>> >>>>>>>> e1200000-e121ffff : 0000:00:19.0 >>>>>>>> e1200000-e121ffff : e1000e >>>>>>>> e123e000-e123efff : 0000:00:19.0 >>>>>>>> e123e000-e123efff : e1000e >>>>>>>> >>>>>>>> Clearly pci core knows the resource is associated with the device, but >>>>>>>> I don't think we're tapping into that with request_resource(), we're >>>>>>>> just potentially stealing resources that another driver might have >>>>>>>> claimed otherwise as I described above. That's my suspicion at >>>>>>>> least, feel free to show otherwise if it's incorrect. Thanks, >>>>>>>> >>>>>>>> Alex >>>>>>>> >>>>>>> Thanks for your explanation. But I still have one question. >>>>>>> Shouldn't PCI core have claimed all PCI device's resources >>>>>>> after probing those devices. If so, request_resource() will fail >>>>>>> when vfio-pci try to steal resources that another driver might >>>>>>> request later. Anything I missed here? Some device resources >>>>>>> would not be claimed in PCI core? >>>>>> Hi Yongji, >>>>>> >>>>>> I don't know what to say, this is not how the interface currently >>>>>> works. request_resource() is a driver level interface that tries to >>>>>> prevent drivers from claiming conflicting resources. In this patch >>>>>> you're trying to use it to probe whether a resource maps to another >>>>>> device. This is not what it does. request_resource() will happily let >>>>>> you claim any resource you want, so long as nobody else claimed it >>>>>> first. So the only case where the assumptions in this patch are valid >>>>>> is if we can guarantee that any potentially conflicting device has a >>>>>> driver loaded that has claimed those resources. As it is here, >>>>>> vfio-pci will happily attempt to request resources that might overlap >>>>>> with another device and might break drivers that haven't yet had a >>>>>> chance to probe their devices. I don't think that's acceptable. >>>>>> Thanks, >>>>>> >>>>>> Alex >>>>>> >>>>> I'm trying to get your point. Let me give an example here. >>>>> I'm not sure whether my understanding is right. Please >>>>> point it out if I'm wrong. >>>>> >>>>> We assume that there are two PCI devices like this: >>>>> >>>>> 240000000000-24feffffffff : /pciex@3fffe40400000 >>>>> 240000000000-2400ffffffff : PCI Bus 0002:01 >>>>> 240000000000-240000007fff : 0002:01:00.0 >>>>> 240000000000-240000007fff : vfio-pci >>>>> 240000008000-24000000ffff : 0002:01:01.0 >>>>> 240000008000-24000000ffff : lpfc >>>>> >>>>> Do you mean vfio-pci driver will succeed in requesting >>>>> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K) >>>>> if it is loaded before lpfc driver? Like this: >>>>> >>>>> 240000000000-24feffffffff : /pciex@3fffe40400000 >>>>> 240000000000-2400ffffffff : PCI Bus 0002:01 >>>>> 240000000000-240000007fff : 0002:01:00.0 >>>>> 240000000000-240000007fff : vfio-pci >>>>> 240000008000-24000000ffff : 0002:01:01.0 >>>>> 240000008000-24000000ffff : <BAD> --> vfio-pci call >>>>> request_resource() >>>>> >>>>> Then lpfc driver will fail when it attempts to call >>>>> pci_request_regions() later. >>>> Yes, that is my supposition. >>>> >>>>> But is it possible that the dummy_res become the child of >>>>> the res: 0002:01:01.0? Wouldn't request_resource() fail when >>>>> it found parent res: PCI Bus 0002:01 already have conflict >>>>> child res: 0002:01:01.0. >>>>> >>>>> And I think the case that request_resource() will succeed >>>>> should like this: >>>>> >>>>> 240000000000-24feffffffff : /pciex@3fffe40400000 >>>>> 240000000000-2400ffffffff : PCI Bus 0002:01 >>>>> 240000000000-240000007fff : 0002:01:00.0 >>>>> 240000010000-240000017fff : 0002:01:01.0 >>>>> >>>>> There is a mem hole: [240000008000-24000000ffff] after >>>>> PCI probing. After loading drivers, the resources tree >>>>> will be: >>>>> >>>>> 240000000000-24feffffffff : /pciex@3fffe40400000 >>>>> 240000000000-2400ffffffff : PCI Bus 0002:01 >>>>> 240000000000-240000007fff : 0002:01:00.0 >>>>> 240000000000-240000007fff : vfio-pci >>>>> 240000008000-24000000ffff : <BAD> ---> vfio-pci call >>>>> request_resource() >>>>> 240000010000-240000017fff : 0002:01:01.0 >>>>> 240000010000-240000017fff : lpfc >>>> Ok, let's stop guessing about this. I modified your patch as follows >>>> so I could easily test it on a 4k host: >>>> >>>> --- a/drivers/vfio/pci/vfio_pci.c >>>> +++ b/drivers/vfio/pci/vfio_pci.c >>>> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) >>>> return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; >>>> } >>>> >>>> +#define VFIO_64K_PAGE_SIZE (64*1024) >>>> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1)) >>>> + >>>> static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) >>>> { >>>> struct resource *res; >>>> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) >>>> if (!resource_size(res)) >>>> goto no_mmap; >>>> >>>> - if (resource_size(res) >= PAGE_SIZE) { >>>> + if (resource_size(res) >= VFIO_64K_PAGE_SIZE) { >>>> vdev->bar_mmap_supported[bar] = true; >>>> continue; >>>> } >>>> >>>> - if (!(res->start & ~PAGE_MASK)) { >>>> + if (!(res->start & ~VFIO_64K_PAGE_MASK)) { >>>> + int ret; >>>> /* >>>> * Add a dummy resource to reserve the remainder >>>> * of the exclusive page in case that hot-add >>>> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) >>>> goto no_mmap; >>>> >>>> dummy_res->resource.start = res->end + 1; >>>> - dummy_res->resource.end = res->start + PAGE_SIZE - 1; >>>> + dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1; >>>> dummy_res->resource.flags = res->flags; >>>> - if (request_resource(res->parent, >>>> - &dummy_res->resource)) { >>>> + ret = request_resource(res->parent, >>>> + &dummy_res->resource); >>>> + if (ret) { >>>> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret); >>>> kfree(dummy_res); >>>> goto no_mmap; >>>> } >>>> >>>> IOW, enforce 64k for mmap regardless of PAGE_SIZE. Then I find a >>>> system configuration to test it: >>>> >>>> ee400000-ef4fffff : PCI Bus 0000:07 >>>> ef480000-ef49ffff : 0000:07:00.0 >>>> ef480000-ef483fff : 0000:08:10.0 >>>> ef484000-ef487fff : 0000:08:10.2 >>>> ef488000-ef48bfff : 0000:08:10.4 >>>> ef48c000-ef48ffff : 0000:08:10.6 >>>> ef490000-ef493fff : 0000:08:11.0 >>>> ef494000-ef497fff : 0000:08:11.2 >>>> ef498000-ef49bfff : 0000:08:11.4 >>>> ef4a0000-ef4bffff : 0000:07:00.0 >>>> ef4a0000-ef4a3fff : 0000:08:10.0 >>>> ef4a4000-ef4a7fff : 0000:08:10.2 >>>> ef4a8000-ef4abfff : 0000:08:10.4 >>>> ef4ac000-ef4affff : 0000:08:10.6 >>>> ef4b0000-ef4b3fff : 0000:08:11.0 >>>> ef4b4000-ef4b7fff : 0000:08:11.2 >>>> ef4b8000-ef4bbfff : 0000:08:11.4 >>>> >>>> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all >>>> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a >>>> separate contiguous range. The igbvf driver is not loaded, so the >>>> other resources are free to be claimed, what happens? >>>> >>>> It looks like you're right, the request_resource() fails with: >>>> >>>> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16) >>>> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16) >>>> >>>> So we get back -EBUSY which means we hit a conflict. I would have >>>> expected that this means our res->parent that we're using for >>>> request_resource() is only, for instance, ef480000-ef483fff (ie. the >>>> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of >>>> the parent, but adding the parent resource range to the dev_info(), it >>>> actually shows the range being ef480000-ef49ffff, so the parent is >>>> actually the 07:00.0 resource. In fact, we can't even use >>>> request_resource() like this to claim the BAR itself, which is why we >>>> use pci_request_selected_regions(), which allows conflicts, putting the >>>> requested resource at the leaf of the tree. >>>> >>>> So I guess I retract this concern about the use of request_resource(), >>>> it does seem to behave as intended. Unless I can spot anything else or >>>> other reviewers have comments, I'll queue this into my next branch for >>>> v4.8. Thanks, >>> Ok, one more test, I found that I have access to the following USB >>> devices: >>> >>> 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI]) >>> Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K] >>> >>> 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI]) >>> Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K] >>> >>> These are nicely mapped such that vfio-pci can claim the residual space >>> from the page, which results in the following in /proc/iomem: >>> >>> f7a07000-f7a073ff : 0000:00:1d.0 >>> f7a07000-f7a073ff : vfio >>> f7a07400-f7a07fff : <BAD> >>> f7a08000-f7a083ff : 0000:00:1a.0 >>> f7a08000-f7a083ff : vfio >>> f7a08400-f7a08fff : <BAD> >>> >>> I should have looked more closely at your previous reply, I didn't >>> think that "<BAD>" was literally the owner of these dummy resources. >>> Please fix this to report something that isn't going frighten users >>> and make small children cry. Thanks, >> Yeah, I also noticed that:-). Now I'm trying to find a proper >> name. What do you think about "vfio-pci (dummy)"? > How about "vfio sub-page reserved"? Thanks, Sounds good. I'll send a new version soon. Thanks, Yongji -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; } +#define VFIO_64K_PAGE_SIZE (64*1024) +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1)) + static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) { struct resource *res; @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) if (!resource_size(res)) goto no_mmap; - if (resource_size(res) >= PAGE_SIZE) { + if (resource_size(res) >= VFIO_64K_PAGE_SIZE) { vdev->bar_mmap_supported[bar] = true; continue; } - if (!(res->start & ~PAGE_MASK)) { + if (!(res->start & ~VFIO_64K_PAGE_MASK)) { + int ret; /* * Add a dummy resource to reserve the remainder * of the exclusive page in case that hot-add @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) goto no_mmap; dummy_res->resource.start = res->end + 1; - dummy_res->resource.end = res->start + PAGE_SIZE - 1; + dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1; dummy_res->resource.flags = res->flags; - if (request_resource(res->parent, - &dummy_res->resource)) { + ret = request_resource(res->parent, + &dummy_res->resource); + if (ret) { +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret); kfree(dummy_res); goto no_mmap; }