diff mbox

[v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

Message ID 20160628134723.7ffefcf3@t450s.home (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Williamson June 28, 2016, 7:47 p.m. UTC
On Tue, 28 Jun 2016 18:09:46 +0800
Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:

> Hi, Alex
> 
> On 2016/6/25 0:43, Alex Williamson wrote:
> 
> > On Fri, 24 Jun 2016 23:37:02 +0800
> > Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
> >  
> >> Hi, Alex
> >>
> >> On 2016/6/24 11:37, Alex Williamson wrote:
> >>  
> >>> On Fri, 24 Jun 2016 10:52:58 +0800
> >>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:  
> >>>> On 2016/6/24 0:12, Alex Williamson wrote:  
> >>>>> On Mon, 30 May 2016 21:06:37 +0800
> >>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:  
> >>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >>>>>> +{
> >>>>>> +	struct resource *res;
> >>>>>> +	int bar;
> >>>>>> +	struct vfio_pci_dummy_resource *dummy_res;
> >>>>>> +
> >>>>>> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
> >>>>>> +
> >>>>>> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> >>>>>> +		res = vdev->pdev->resource + bar;
> >>>>>> +
> >>>>>> +		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> >>>>>> +			goto no_mmap;
> >>>>>> +
> >>>>>> +		if (!(res->flags & IORESOURCE_MEM))
> >>>>>> +			goto no_mmap;
> >>>>>> +
> >>>>>> +		/*
> >>>>>> +		 * The PCI core shouldn't set up a resource with a
> >>>>>> +		 * type but zero size. But there may be bugs that
> >>>>>> +		 * cause us to do that.
> >>>>>> +		 */
> >>>>>> +		if (!resource_size(res))
> >>>>>> +			goto no_mmap;
> >>>>>> +
> >>>>>> +		if (resource_size(res) >= PAGE_SIZE) {
> >>>>>> +			vdev->bar_mmap_supported[bar] = true;
> >>>>>> +			continue;
> >>>>>> +		}
> >>>>>> +
> >>>>>> +		if (!(res->start & ~PAGE_MASK)) {
> >>>>>> +			/*
> >>>>>> +			 * Add a dummy resource to reserve the remainder
> >>>>>> +			 * of the exclusive page in case that hot-add
> >>>>>> +			 * device's bar is assigned into it.
> >>>>>> +			 */
> >>>>>> +			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> >>>>>> +			if (dummy_res == NULL)
> >>>>>> +				goto no_mmap;
> >>>>>> +
> >>>>>> +			dummy_res->resource.start = res->end + 1;
> >>>>>> +			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> >>>>>> +			dummy_res->resource.flags = res->flags;
> >>>>>> +			if (request_resource(res->parent,
> >>>>>> +						&dummy_res->resource)) {
> >>>>>> +				kfree(dummy_res);
> >>>>>> +				goto no_mmap;
> >>>>>> +			}  
> >>>>> Isn't it true that request_resource() only tells us that at a given
> >>>>> point in time, no other drivers have reserved that resource?  It seems
> >>>>> like it does not guarantee that the resource isn't routed to another
> >>>>> device or that another driver won't at some point attempt to request
> >>>>> that same resource.  So for example if a user constructs their initrd
> >>>>> to bind vfio-pci to devices before other modules load, this
> >>>>> request_resource() may succeed, at the expense of drivers loaded later
> >>>>> now failing.  The behavior will depend on driver load order and we're
> >>>>> not actually insuring that the overflow resource is unused, just that
> >>>>> we got it first.  Can we do better?  Am I missing something that
> >>>>> prevents this?  Thanks,
> >>>>>
> >>>>> Alex  
> >>>> Couldn't PCI resources allocator prevent this, which will find a
> >>>> empty slot in the resource tree firstly, then try to request that
> >>>> resource in allocate_resource() when a PCI device is probed.
> >>>> And I'd like to know why a PCI device driver would attempt to
> >>>> call request_resource()? Should this be done in PCI enumeration?  
> >>> Hi Yongji,
> >>>
> >>> Looks like most pci drivers call pci_request_regions().  From there the
> >>> call path is:
> >>>
> >>> pci_request_selected_regions
> >>>     __pci_request_selected_regions
> >>>       __pci_request_region
> >>>         __request_mem_region
> >>>           __request_region
> >>>             __request_resource
> >>>
> >>> We see this driver ordering issue sometimes with users attempting to
> >>> blacklist native pci drivers, trying to leave a device free for use by
> >>> vfio-pci.  If the device is a graphics card, the generic vesa or uefi
> >>> driver can request device resources causing a failure when vfio-pci
> >>> tries to request those same resources.  I expect that unless it's a
> >>> boot device, like vga in my example, the resources are not enabled
> >>> until the driver opens the device, therefore the request_resource() call
> >>> doesn't occur until that point.
> >>>
> >>> For another trivial example, look at /proc/iomem as you load and unload
> >>> a driver, on my laptop with e1000e unloaded I see:
> >>>
> >>>     e1200000-e121ffff : 0000:00:19.0
> >>>     e123e000-e123efff : 0000:00:19.0
> >>>
> >>> When e1000e is loaded, each of these becomes claimed by the e1000e
> >>> driver:
> >>>
> >>>     e1200000-e121ffff : 0000:00:19.0
> >>>       e1200000-e121ffff : e1000e
> >>>     e123e000-e123efff : 0000:00:19.0
> >>>       e123e000-e123efff : e1000e
> >>>
> >>> Clearly pci core knows the resource is associated with the device, but
> >>> I don't think we're tapping into that with request_resource(), we're
> >>> just potentially stealing resources that another driver might have
> >>> claimed otherwise as I described above.  That's my suspicion at
> >>> least, feel free to show otherwise if it's incorrect.  Thanks,
> >>>
> >>> Alex
> >>>     
> >> Thanks for your explanation. But I still have one question.
> >> Shouldn't PCI core have claimed all PCI device's resources
> >> after probing those devices. If so, request_resource() will fail
> >> when vfio-pci try to steal resources that another driver might
> >> request later. Anything I missed here?  Some device resources
> >> would not be claimed in PCI core?  
> > Hi Yongji,
> >
> > I don't know what to say, this is not how the interface currently
> > works.  request_resource() is a driver level interface that tries to
> > prevent drivers from claiming conflicting resources.  In this patch
> > you're trying to use it to probe whether a resource maps to another
> > device.  This is not what it does.  request_resource() will happily let
> > you claim any resource you want, so long as nobody else claimed it
> > first.  So the only case where the assumptions in this patch are valid
> > is if we can guarantee that any potentially conflicting device has a
> > driver loaded that has claimed those resources.  As it is here,
> > vfio-pci will happily attempt to request resources that might overlap
> > with another device and might break drivers that haven't yet had a
> > chance to probe their devices.  I don't think that's acceptable.
> > Thanks,
> >
> > Alex
> >  
> 
> I'm trying to get your point. Let me give an example here.
> I'm not sure whether my understanding is right. Please
> point it out if I'm wrong.
> 
> We assume that there are two PCI devices like this:
> 
> 240000000000-24feffffffff : /pciex@3fffe40400000
>    240000000000-2400ffffffff : PCI Bus 0002:01
>      240000000000-240000007fff : 0002:01:00.0
>        240000000000-240000007fff : vfio-pci
>      240000008000-24000000ffff : 0002:01:01.0
>        240000008000-24000000ffff : lpfc
> 
> Do you mean vfio-pci driver will succeed in requesting
> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K)
> if it is loaded before lpfc driver? Like this:
> 
> 240000000000-24feffffffff : /pciex@3fffe40400000
>    240000000000-2400ffffffff : PCI Bus 0002:01
>      240000000000-240000007fff : 0002:01:00.0
>        240000000000-240000007fff : vfio-pci
>      240000008000-24000000ffff : 0002:01:01.0
>        240000008000-24000000ffff : <BAD>    --> vfio-pci call 
> request_resource()
> 
> Then lpfc driver will fail when it attempts to call
> pci_request_regions() later.

Yes, that is my supposition.
 
> But is it possible that the dummy_res become the child of
> the res: 0002:01:01.0? Wouldn't request_resource() fail when
> it found parent res: PCI Bus 0002:01 already have conflict
> child res: 0002:01:01.0.
> 
> And I think the case that request_resource() will succeed
> should like this:
> 
> 240000000000-24feffffffff : /pciex@3fffe40400000
>    240000000000-2400ffffffff : PCI Bus 0002:01
>      240000000000-240000007fff : 0002:01:00.0
>      240000010000-240000017fff : 0002:01:01.0
> 
> There is a mem hole: [240000008000-24000000ffff] after
> PCI probing.  After loading drivers, the resources tree
> will be:
> 
> 240000000000-24feffffffff : /pciex@3fffe40400000
>    240000000000-2400ffffffff : PCI Bus 0002:01
>      240000000000-240000007fff : 0002:01:00.0
>        240000000000-240000007fff : vfio-pci
>      240000008000-24000000ffff : <BAD>    ---> vfio-pci call 
> request_resource()
>      240000010000-240000017fff : 0002:01:01.0
>        240000010000-240000017fff : lpfc

Ok, let's stop guessing about this.  I modified your patch as follows
so I could easily test it on a 4k host:


IOW, enforce 64k for mmap regardless of PAGE_SIZE.  Then I find a
system configuration to test it:

  ee400000-ef4fffff : PCI Bus 0000:07
    ef480000-ef49ffff : 0000:07:00.0
      ef480000-ef483fff : 0000:08:10.0
      ef484000-ef487fff : 0000:08:10.2
      ef488000-ef48bfff : 0000:08:10.4
      ef48c000-ef48ffff : 0000:08:10.6
      ef490000-ef493fff : 0000:08:11.0
      ef494000-ef497fff : 0000:08:11.2
      ef498000-ef49bfff : 0000:08:11.4
    ef4a0000-ef4bffff : 0000:07:00.0
      ef4a0000-ef4a3fff : 0000:08:10.0
      ef4a4000-ef4a7fff : 0000:08:10.2
      ef4a8000-ef4abfff : 0000:08:10.4
      ef4ac000-ef4affff : 0000:08:10.6
      ef4b0000-ef4b3fff : 0000:08:11.0
      ef4b4000-ef4b7fff : 0000:08:11.2
      ef4b8000-ef4bbfff : 0000:08:11.4

This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all
the VF BAR0s are in a contiguous range and all the VF BAR3s are in a
separate contiguous range.  The igbvf driver is not loaded, so the
other resources are free to be claimed, what happens?

It looks like you're right, the request_resource() fails with:

vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16)
vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16)

So we get back -EBUSY which means we hit a conflict.  I would have
expected that this means our res->parent that we're using for
request_resource() is only, for instance, ef480000-ef483fff (ie. the
BAR itself) thus our request for ef484000-ef48ffff exceeds the end of
the parent, but adding the parent resource range to the dev_info(), it
actually shows the range being ef480000-ef49ffff, so the parent is
actually the 07:00.0 resource.  In fact, we can't even use
request_resource() like this to claim the BAR itself, which is why we
use pci_request_selected_regions(), which allows conflicts, putting the
requested resource at the leaf of the tree.

So I guess I retract this concern about the use of request_resource(),
it does seem to behave as intended.  Unless I can spot anything else or
other reviewers have comments, I'll queue this into my next branch for
v4.8.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alex Williamson June 29, 2016, 8:03 p.m. UTC | #1
On Tue, 28 Jun 2016 13:47:23 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 28 Jun 2016 18:09:46 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
> 
> > Hi, Alex
> > 
> > On 2016/6/25 0:43, Alex Williamson wrote:
> >   
> > > On Fri, 24 Jun 2016 23:37:02 +0800
> > > Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
> > >    
> > >> Hi, Alex
> > >>
> > >> On 2016/6/24 11:37, Alex Williamson wrote:
> > >>    
> > >>> On Fri, 24 Jun 2016 10:52:58 +0800
> > >>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:    
> > >>>> On 2016/6/24 0:12, Alex Williamson wrote:    
> > >>>>> On Mon, 30 May 2016 21:06:37 +0800
> > >>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:    
> > >>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> > >>>>>> +{
> > >>>>>> +	struct resource *res;
> > >>>>>> +	int bar;
> > >>>>>> +	struct vfio_pci_dummy_resource *dummy_res;
> > >>>>>> +
> > >>>>>> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
> > >>>>>> +
> > >>>>>> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> > >>>>>> +		res = vdev->pdev->resource + bar;
> > >>>>>> +
> > >>>>>> +		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> > >>>>>> +			goto no_mmap;
> > >>>>>> +
> > >>>>>> +		if (!(res->flags & IORESOURCE_MEM))
> > >>>>>> +			goto no_mmap;
> > >>>>>> +
> > >>>>>> +		/*
> > >>>>>> +		 * The PCI core shouldn't set up a resource with a
> > >>>>>> +		 * type but zero size. But there may be bugs that
> > >>>>>> +		 * cause us to do that.
> > >>>>>> +		 */
> > >>>>>> +		if (!resource_size(res))
> > >>>>>> +			goto no_mmap;
> > >>>>>> +
> > >>>>>> +		if (resource_size(res) >= PAGE_SIZE) {
> > >>>>>> +			vdev->bar_mmap_supported[bar] = true;
> > >>>>>> +			continue;
> > >>>>>> +		}
> > >>>>>> +
> > >>>>>> +		if (!(res->start & ~PAGE_MASK)) {
> > >>>>>> +			/*
> > >>>>>> +			 * Add a dummy resource to reserve the remainder
> > >>>>>> +			 * of the exclusive page in case that hot-add
> > >>>>>> +			 * device's bar is assigned into it.
> > >>>>>> +			 */
> > >>>>>> +			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> > >>>>>> +			if (dummy_res == NULL)
> > >>>>>> +				goto no_mmap;
> > >>>>>> +
> > >>>>>> +			dummy_res->resource.start = res->end + 1;
> > >>>>>> +			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> > >>>>>> +			dummy_res->resource.flags = res->flags;
> > >>>>>> +			if (request_resource(res->parent,
> > >>>>>> +						&dummy_res->resource)) {
> > >>>>>> +				kfree(dummy_res);
> > >>>>>> +				goto no_mmap;
> > >>>>>> +			}    
> > >>>>> Isn't it true that request_resource() only tells us that at a given
> > >>>>> point in time, no other drivers have reserved that resource?  It seems
> > >>>>> like it does not guarantee that the resource isn't routed to another
> > >>>>> device or that another driver won't at some point attempt to request
> > >>>>> that same resource.  So for example if a user constructs their initrd
> > >>>>> to bind vfio-pci to devices before other modules load, this
> > >>>>> request_resource() may succeed, at the expense of drivers loaded later
> > >>>>> now failing.  The behavior will depend on driver load order and we're
> > >>>>> not actually insuring that the overflow resource is unused, just that
> > >>>>> we got it first.  Can we do better?  Am I missing something that
> > >>>>> prevents this?  Thanks,
> > >>>>>
> > >>>>> Alex    
> > >>>> Couldn't PCI resources allocator prevent this, which will find a
> > >>>> empty slot in the resource tree firstly, then try to request that
> > >>>> resource in allocate_resource() when a PCI device is probed.
> > >>>> And I'd like to know why a PCI device driver would attempt to
> > >>>> call request_resource()? Should this be done in PCI enumeration?    
> > >>> Hi Yongji,
> > >>>
> > >>> Looks like most pci drivers call pci_request_regions().  From there the
> > >>> call path is:
> > >>>
> > >>> pci_request_selected_regions
> > >>>     __pci_request_selected_regions
> > >>>       __pci_request_region
> > >>>         __request_mem_region
> > >>>           __request_region
> > >>>             __request_resource
> > >>>
> > >>> We see this driver ordering issue sometimes with users attempting to
> > >>> blacklist native pci drivers, trying to leave a device free for use by
> > >>> vfio-pci.  If the device is a graphics card, the generic vesa or uefi
> > >>> driver can request device resources causing a failure when vfio-pci
> > >>> tries to request those same resources.  I expect that unless it's a
> > >>> boot device, like vga in my example, the resources are not enabled
> > >>> until the driver opens the device, therefore the request_resource() call
> > >>> doesn't occur until that point.
> > >>>
> > >>> For another trivial example, look at /proc/iomem as you load and unload
> > >>> a driver, on my laptop with e1000e unloaded I see:
> > >>>
> > >>>     e1200000-e121ffff : 0000:00:19.0
> > >>>     e123e000-e123efff : 0000:00:19.0
> > >>>
> > >>> When e1000e is loaded, each of these becomes claimed by the e1000e
> > >>> driver:
> > >>>
> > >>>     e1200000-e121ffff : 0000:00:19.0
> > >>>       e1200000-e121ffff : e1000e
> > >>>     e123e000-e123efff : 0000:00:19.0
> > >>>       e123e000-e123efff : e1000e
> > >>>
> > >>> Clearly pci core knows the resource is associated with the device, but
> > >>> I don't think we're tapping into that with request_resource(), we're
> > >>> just potentially stealing resources that another driver might have
> > >>> claimed otherwise as I described above.  That's my suspicion at
> > >>> least, feel free to show otherwise if it's incorrect.  Thanks,
> > >>>
> > >>> Alex
> > >>>       
> > >> Thanks for your explanation. But I still have one question.
> > >> Shouldn't PCI core have claimed all PCI device's resources
> > >> after probing those devices. If so, request_resource() will fail
> > >> when vfio-pci try to steal resources that another driver might
> > >> request later. Anything I missed here?  Some device resources
> > >> would not be claimed in PCI core?    
> > > Hi Yongji,
> > >
> > > I don't know what to say, this is not how the interface currently
> > > works.  request_resource() is a driver level interface that tries to
> > > prevent drivers from claiming conflicting resources.  In this patch
> > > you're trying to use it to probe whether a resource maps to another
> > > device.  This is not what it does.  request_resource() will happily let
> > > you claim any resource you want, so long as nobody else claimed it
> > > first.  So the only case where the assumptions in this patch are valid
> > > is if we can guarantee that any potentially conflicting device has a
> > > driver loaded that has claimed those resources.  As it is here,
> > > vfio-pci will happily attempt to request resources that might overlap
> > > with another device and might break drivers that haven't yet had a
> > > chance to probe their devices.  I don't think that's acceptable.
> > > Thanks,
> > >
> > > Alex
> > >    
> > 
> > I'm trying to get your point. Let me give an example here.
> > I'm not sure whether my understanding is right. Please
> > point it out if I'm wrong.
> > 
> > We assume that there are two PCI devices like this:
> > 
> > 240000000000-24feffffffff : /pciex@3fffe40400000
> >    240000000000-2400ffffffff : PCI Bus 0002:01
> >      240000000000-240000007fff : 0002:01:00.0
> >        240000000000-240000007fff : vfio-pci
> >      240000008000-24000000ffff : 0002:01:01.0
> >        240000008000-24000000ffff : lpfc
> > 
> > Do you mean vfio-pci driver will succeed in requesting
> > dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K)
> > if it is loaded before lpfc driver? Like this:
> > 
> > 240000000000-24feffffffff : /pciex@3fffe40400000
> >    240000000000-2400ffffffff : PCI Bus 0002:01
> >      240000000000-240000007fff : 0002:01:00.0
> >        240000000000-240000007fff : vfio-pci
> >      240000008000-24000000ffff : 0002:01:01.0
> >        240000008000-24000000ffff : <BAD>    --> vfio-pci call 
> > request_resource()
> > 
> > Then lpfc driver will fail when it attempts to call
> > pci_request_regions() later.  
> 
> Yes, that is my supposition.
>  
> > But is it possible that the dummy_res become the child of
> > the res: 0002:01:01.0? Wouldn't request_resource() fail when
> > it found parent res: PCI Bus 0002:01 already have conflict
> > child res: 0002:01:01.0.
> > 
> > And I think the case that request_resource() will succeed
> > should like this:
> > 
> > 240000000000-24feffffffff : /pciex@3fffe40400000
> >    240000000000-2400ffffffff : PCI Bus 0002:01
> >      240000000000-240000007fff : 0002:01:00.0
> >      240000010000-240000017fff : 0002:01:01.0
> > 
> > There is a mem hole: [240000008000-24000000ffff] after
> > PCI probing.  After loading drivers, the resources tree
> > will be:
> > 
> > 240000000000-24feffffffff : /pciex@3fffe40400000
> >    240000000000-2400ffffffff : PCI Bus 0002:01
> >      240000000000-240000007fff : 0002:01:00.0
> >        240000000000-240000007fff : vfio-pci
> >      240000008000-24000000ffff : <BAD>    ---> vfio-pci call 
> > request_resource()
> >      240000010000-240000017fff : 0002:01:01.0
> >        240000010000-240000017fff : lpfc  
> 
> Ok, let's stop guessing about this.  I modified your patch as follows
> so I could easily test it on a 4k host:
> 
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>  	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>  }
>  
> +#define VFIO_64K_PAGE_SIZE (64*1024)
> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1))
> +
>  static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>  {
>  	struct resource *res;
> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>  		if (!resource_size(res))
>  			goto no_mmap;
>  
> -		if (resource_size(res) >= PAGE_SIZE) {
> +		if (resource_size(res) >= VFIO_64K_PAGE_SIZE) {
>  			vdev->bar_mmap_supported[bar] = true;
>  			continue;
>  		}
>  
> -		if (!(res->start & ~PAGE_MASK)) {
> +		if (!(res->start & ~VFIO_64K_PAGE_MASK)) {
> +			int ret;
>  			/*
>  			 * Add a dummy resource to reserve the remainder
>  			 * of the exclusive page in case that hot-add
> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>  				goto no_mmap;
>  
>  			dummy_res->resource.start = res->end + 1;
> -			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> +			dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1;
>  			dummy_res->resource.flags = res->flags;
> -			if (request_resource(res->parent,
> -						&dummy_res->resource)) {
> +			ret = request_resource(res->parent,
> +						&dummy_res->resource);
> +			if (ret) {
> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret);
>  				kfree(dummy_res);
>  				goto no_mmap;
>  			}
> 
> IOW, enforce 64k for mmap regardless of PAGE_SIZE.  Then I find a
> system configuration to test it:
> 
>   ee400000-ef4fffff : PCI Bus 0000:07
>     ef480000-ef49ffff : 0000:07:00.0
>       ef480000-ef483fff : 0000:08:10.0
>       ef484000-ef487fff : 0000:08:10.2
>       ef488000-ef48bfff : 0000:08:10.4
>       ef48c000-ef48ffff : 0000:08:10.6
>       ef490000-ef493fff : 0000:08:11.0
>       ef494000-ef497fff : 0000:08:11.2
>       ef498000-ef49bfff : 0000:08:11.4
>     ef4a0000-ef4bffff : 0000:07:00.0
>       ef4a0000-ef4a3fff : 0000:08:10.0
>       ef4a4000-ef4a7fff : 0000:08:10.2
>       ef4a8000-ef4abfff : 0000:08:10.4
>       ef4ac000-ef4affff : 0000:08:10.6
>       ef4b0000-ef4b3fff : 0000:08:11.0
>       ef4b4000-ef4b7fff : 0000:08:11.2
>       ef4b8000-ef4bbfff : 0000:08:11.4
> 
> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all
> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a
> separate contiguous range.  The igbvf driver is not loaded, so the
> other resources are free to be claimed, what happens?
> 
> It looks like you're right, the request_resource() fails with:
> 
> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16)
> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16)
> 
> So we get back -EBUSY which means we hit a conflict.  I would have
> expected that this means our res->parent that we're using for
> request_resource() is only, for instance, ef480000-ef483fff (ie. the
> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of
> the parent, but adding the parent resource range to the dev_info(), it
> actually shows the range being ef480000-ef49ffff, so the parent is
> actually the 07:00.0 resource.  In fact, we can't even use
> request_resource() like this to claim the BAR itself, which is why we
> use pci_request_selected_regions(), which allows conflicts, putting the
> requested resource at the leaf of the tree.
> 
> So I guess I retract this concern about the use of request_resource(),
> it does seem to behave as intended.  Unless I can spot anything else or
> other reviewers have comments, I'll queue this into my next branch for
> v4.8.  Thanks,


Ok, one more test, I found that I have access to the following USB
devices:

00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI])
	Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K]

00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI])
	Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K]

These are nicely mapped such that vfio-pci can claim the residual space
from the page, which results in the following in /proc/iomem:

  f7a07000-f7a073ff : 0000:00:1d.0
    f7a07000-f7a073ff : vfio
  f7a07400-f7a07fff : <BAD>
  f7a08000-f7a083ff : 0000:00:1a.0
    f7a08000-f7a083ff : vfio
  f7a08400-f7a08fff : <BAD>

I should have looked more closely at your previous reply, I didn't
think that "<BAD>" was literally the owner of these dummy resources.
Please fix this to report something that isn't going frighten users
and make small children cry.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yongji Xie June 30, 2016, 2:40 a.m. UTC | #2
Hi Alex,

On 2016/6/30 4:03, Alex Williamson wrote:

> On Tue, 28 Jun 2016 13:47:23 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
>
>> On Tue, 28 Jun 2016 18:09:46 +0800
>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>
>>> Hi, Alex
>>>
>>> On 2016/6/25 0:43, Alex Williamson wrote:
>>>    
>>>> On Fri, 24 Jun 2016 23:37:02 +0800
>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>>>     
>>>>> Hi, Alex
>>>>>
>>>>> On 2016/6/24 11:37, Alex Williamson wrote:
>>>>>     
>>>>>> On Fri, 24 Jun 2016 10:52:58 +0800
>>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>>>>>> On 2016/6/24 0:12, Alex Williamson wrote:
>>>>>>>> On Mon, 30 May 2016 21:06:37 +0800
>>>>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>>>>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>>>>>>> +{
>>>>>>>>> +	struct resource *res;
>>>>>>>>> +	int bar;
>>>>>>>>> +	struct vfio_pci_dummy_resource *dummy_res;
>>>>>>>>> +
>>>>>>>>> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
>>>>>>>>> +
>>>>>>>>> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
>>>>>>>>> +		res = vdev->pdev->resource + bar;
>>>>>>>>> +
>>>>>>>>> +		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
>>>>>>>>> +			goto no_mmap;
>>>>>>>>> +
>>>>>>>>> +		if (!(res->flags & IORESOURCE_MEM))
>>>>>>>>> +			goto no_mmap;
>>>>>>>>> +
>>>>>>>>> +		/*
>>>>>>>>> +		 * The PCI core shouldn't set up a resource with a
>>>>>>>>> +		 * type but zero size. But there may be bugs that
>>>>>>>>> +		 * cause us to do that.
>>>>>>>>> +		 */
>>>>>>>>> +		if (!resource_size(res))
>>>>>>>>> +			goto no_mmap;
>>>>>>>>> +
>>>>>>>>> +		if (resource_size(res) >= PAGE_SIZE) {
>>>>>>>>> +			vdev->bar_mmap_supported[bar] = true;
>>>>>>>>> +			continue;
>>>>>>>>> +		}
>>>>>>>>> +
>>>>>>>>> +		if (!(res->start & ~PAGE_MASK)) {
>>>>>>>>> +			/*
>>>>>>>>> +			 * Add a dummy resource to reserve the remainder
>>>>>>>>> +			 * of the exclusive page in case that hot-add
>>>>>>>>> +			 * device's bar is assigned into it.
>>>>>>>>> +			 */
>>>>>>>>> +			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
>>>>>>>>> +			if (dummy_res == NULL)
>>>>>>>>> +				goto no_mmap;
>>>>>>>>> +
>>>>>>>>> +			dummy_res->resource.start = res->end + 1;
>>>>>>>>> +			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>>>>>>>>> +			dummy_res->resource.flags = res->flags;
>>>>>>>>> +			if (request_resource(res->parent,
>>>>>>>>> +						&dummy_res->resource)) {
>>>>>>>>> +				kfree(dummy_res);
>>>>>>>>> +				goto no_mmap;
>>>>>>>>> +			}
>>>>>>>> Isn't it true that request_resource() only tells us that at a given
>>>>>>>> point in time, no other drivers have reserved that resource?  It seems
>>>>>>>> like it does not guarantee that the resource isn't routed to another
>>>>>>>> device or that another driver won't at some point attempt to request
>>>>>>>> that same resource.  So for example if a user constructs their initrd
>>>>>>>> to bind vfio-pci to devices before other modules load, this
>>>>>>>> request_resource() may succeed, at the expense of drivers loaded later
>>>>>>>> now failing.  The behavior will depend on driver load order and we're
>>>>>>>> not actually insuring that the overflow resource is unused, just that
>>>>>>>> we got it first.  Can we do better?  Am I missing something that
>>>>>>>> prevents this?  Thanks,
>>>>>>>>
>>>>>>>> Alex
>>>>>>> Couldn't PCI resources allocator prevent this, which will find a
>>>>>>> empty slot in the resource tree firstly, then try to request that
>>>>>>> resource in allocate_resource() when a PCI device is probed.
>>>>>>> And I'd like to know why a PCI device driver would attempt to
>>>>>>> call request_resource()? Should this be done in PCI enumeration?
>>>>>> Hi Yongji,
>>>>>>
>>>>>> Looks like most pci drivers call pci_request_regions().  From there the
>>>>>> call path is:
>>>>>>
>>>>>> pci_request_selected_regions
>>>>>>      __pci_request_selected_regions
>>>>>>        __pci_request_region
>>>>>>          __request_mem_region
>>>>>>            __request_region
>>>>>>              __request_resource
>>>>>>
>>>>>> We see this driver ordering issue sometimes with users attempting to
>>>>>> blacklist native pci drivers, trying to leave a device free for use by
>>>>>> vfio-pci.  If the device is a graphics card, the generic vesa or uefi
>>>>>> driver can request device resources causing a failure when vfio-pci
>>>>>> tries to request those same resources.  I expect that unless it's a
>>>>>> boot device, like vga in my example, the resources are not enabled
>>>>>> until the driver opens the device, therefore the request_resource() call
>>>>>> doesn't occur until that point.
>>>>>>
>>>>>> For another trivial example, look at /proc/iomem as you load and unload
>>>>>> a driver, on my laptop with e1000e unloaded I see:
>>>>>>
>>>>>>      e1200000-e121ffff : 0000:00:19.0
>>>>>>      e123e000-e123efff : 0000:00:19.0
>>>>>>
>>>>>> When e1000e is loaded, each of these becomes claimed by the e1000e
>>>>>> driver:
>>>>>>
>>>>>>      e1200000-e121ffff : 0000:00:19.0
>>>>>>        e1200000-e121ffff : e1000e
>>>>>>      e123e000-e123efff : 0000:00:19.0
>>>>>>        e123e000-e123efff : e1000e
>>>>>>
>>>>>> Clearly pci core knows the resource is associated with the device, but
>>>>>> I don't think we're tapping into that with request_resource(), we're
>>>>>> just potentially stealing resources that another driver might have
>>>>>> claimed otherwise as I described above.  That's my suspicion at
>>>>>> least, feel free to show otherwise if it's incorrect.  Thanks,
>>>>>>
>>>>>> Alex
>>>>>>        
>>>>> Thanks for your explanation. But I still have one question.
>>>>> Shouldn't PCI core have claimed all PCI device's resources
>>>>> after probing those devices. If so, request_resource() will fail
>>>>> when vfio-pci try to steal resources that another driver might
>>>>> request later. Anything I missed here?  Some device resources
>>>>> would not be claimed in PCI core?
>>>> Hi Yongji,
>>>>
>>>> I don't know what to say, this is not how the interface currently
>>>> works.  request_resource() is a driver level interface that tries to
>>>> prevent drivers from claiming conflicting resources.  In this patch
>>>> you're trying to use it to probe whether a resource maps to another
>>>> device.  This is not what it does.  request_resource() will happily let
>>>> you claim any resource you want, so long as nobody else claimed it
>>>> first.  So the only case where the assumptions in this patch are valid
>>>> is if we can guarantee that any potentially conflicting device has a
>>>> driver loaded that has claimed those resources.  As it is here,
>>>> vfio-pci will happily attempt to request resources that might overlap
>>>> with another device and might break drivers that haven't yet had a
>>>> chance to probe their devices.  I don't think that's acceptable.
>>>> Thanks,
>>>>
>>>> Alex
>>>>     
>>> I'm trying to get your point. Let me give an example here.
>>> I'm not sure whether my understanding is right. Please
>>> point it out if I'm wrong.
>>>
>>> We assume that there are two PCI devices like this:
>>>
>>> 240000000000-24feffffffff : /pciex@3fffe40400000
>>>     240000000000-2400ffffffff : PCI Bus 0002:01
>>>       240000000000-240000007fff : 0002:01:00.0
>>>         240000000000-240000007fff : vfio-pci
>>>       240000008000-24000000ffff : 0002:01:01.0
>>>         240000008000-24000000ffff : lpfc
>>>
>>> Do you mean vfio-pci driver will succeed in requesting
>>> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K)
>>> if it is loaded before lpfc driver? Like this:
>>>
>>> 240000000000-24feffffffff : /pciex@3fffe40400000
>>>     240000000000-2400ffffffff : PCI Bus 0002:01
>>>       240000000000-240000007fff : 0002:01:00.0
>>>         240000000000-240000007fff : vfio-pci
>>>       240000008000-24000000ffff : 0002:01:01.0
>>>         240000008000-24000000ffff : <BAD>    --> vfio-pci call
>>> request_resource()
>>>
>>> Then lpfc driver will fail when it attempts to call
>>> pci_request_regions() later.
>> Yes, that is my supposition.
>>   
>>> But is it possible that the dummy_res become the child of
>>> the res: 0002:01:01.0? Wouldn't request_resource() fail when
>>> it found parent res: PCI Bus 0002:01 already have conflict
>>> child res: 0002:01:01.0.
>>>
>>> And I think the case that request_resource() will succeed
>>> should like this:
>>>
>>> 240000000000-24feffffffff : /pciex@3fffe40400000
>>>     240000000000-2400ffffffff : PCI Bus 0002:01
>>>       240000000000-240000007fff : 0002:01:00.0
>>>       240000010000-240000017fff : 0002:01:01.0
>>>
>>> There is a mem hole: [240000008000-24000000ffff] after
>>> PCI probing.  After loading drivers, the resources tree
>>> will be:
>>>
>>> 240000000000-24feffffffff : /pciex@3fffe40400000
>>>     240000000000-2400ffffffff : PCI Bus 0002:01
>>>       240000000000-240000007fff : 0002:01:00.0
>>>         240000000000-240000007fff : vfio-pci
>>>       240000008000-24000000ffff : <BAD>    ---> vfio-pci call
>>> request_resource()
>>>       240000010000-240000017fff : 0002:01:01.0
>>>         240000010000-240000017fff : lpfc
>> Ok, let's stop guessing about this.  I modified your patch as follows
>> so I could easily test it on a 4k host:
>>
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>>   	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>>   }
>>   
>> +#define VFIO_64K_PAGE_SIZE (64*1024)
>> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1))
>> +
>>   static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>   {
>>   	struct resource *res;
>> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>   		if (!resource_size(res))
>>   			goto no_mmap;
>>   
>> -		if (resource_size(res) >= PAGE_SIZE) {
>> +		if (resource_size(res) >= VFIO_64K_PAGE_SIZE) {
>>   			vdev->bar_mmap_supported[bar] = true;
>>   			continue;
>>   		}
>>   
>> -		if (!(res->start & ~PAGE_MASK)) {
>> +		if (!(res->start & ~VFIO_64K_PAGE_MASK)) {
>> +			int ret;
>>   			/*
>>   			 * Add a dummy resource to reserve the remainder
>>   			 * of the exclusive page in case that hot-add
>> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>   				goto no_mmap;
>>   
>>   			dummy_res->resource.start = res->end + 1;
>> -			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>> +			dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1;
>>   			dummy_res->resource.flags = res->flags;
>> -			if (request_resource(res->parent,
>> -						&dummy_res->resource)) {
>> +			ret = request_resource(res->parent,
>> +						&dummy_res->resource);
>> +			if (ret) {
>> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret);
>>   				kfree(dummy_res);
>>   				goto no_mmap;
>>   			}
>>
>> IOW, enforce 64k for mmap regardless of PAGE_SIZE.  Then I find a
>> system configuration to test it:
>>
>>    ee400000-ef4fffff : PCI Bus 0000:07
>>      ef480000-ef49ffff : 0000:07:00.0
>>        ef480000-ef483fff : 0000:08:10.0
>>        ef484000-ef487fff : 0000:08:10.2
>>        ef488000-ef48bfff : 0000:08:10.4
>>        ef48c000-ef48ffff : 0000:08:10.6
>>        ef490000-ef493fff : 0000:08:11.0
>>        ef494000-ef497fff : 0000:08:11.2
>>        ef498000-ef49bfff : 0000:08:11.4
>>      ef4a0000-ef4bffff : 0000:07:00.0
>>        ef4a0000-ef4a3fff : 0000:08:10.0
>>        ef4a4000-ef4a7fff : 0000:08:10.2
>>        ef4a8000-ef4abfff : 0000:08:10.4
>>        ef4ac000-ef4affff : 0000:08:10.6
>>        ef4b0000-ef4b3fff : 0000:08:11.0
>>        ef4b4000-ef4b7fff : 0000:08:11.2
>>        ef4b8000-ef4bbfff : 0000:08:11.4
>>
>> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all
>> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a
>> separate contiguous range.  The igbvf driver is not loaded, so the
>> other resources are free to be claimed, what happens?
>>
>> It looks like you're right, the request_resource() fails with:
>>
>> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16)
>> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16)
>>
>> So we get back -EBUSY which means we hit a conflict.  I would have
>> expected that this means our res->parent that we're using for
>> request_resource() is only, for instance, ef480000-ef483fff (ie. the
>> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of
>> the parent, but adding the parent resource range to the dev_info(), it
>> actually shows the range being ef480000-ef49ffff, so the parent is
>> actually the 07:00.0 resource.  In fact, we can't even use
>> request_resource() like this to claim the BAR itself, which is why we
>> use pci_request_selected_regions(), which allows conflicts, putting the
>> requested resource at the leaf of the tree.
>>
>> So I guess I retract this concern about the use of request_resource(),
>> it does seem to behave as intended.  Unless I can spot anything else or
>> other reviewers have comments, I'll queue this into my next branch for
>> v4.8.  Thanks,
>
> Ok, one more test, I found that I have access to the following USB
> devices:
>
> 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI])
> 	Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K]
>
> 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI])
> 	Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K]
>
> These are nicely mapped such that vfio-pci can claim the residual space
> from the page, which results in the following in /proc/iomem:
>
>    f7a07000-f7a073ff : 0000:00:1d.0
>      f7a07000-f7a073ff : vfio
>    f7a07400-f7a07fff : <BAD>
>    f7a08000-f7a083ff : 0000:00:1a.0
>      f7a08000-f7a083ff : vfio
>    f7a08400-f7a08fff : <BAD>
>
> I should have looked more closely at your previous reply, I didn't
> think that "<BAD>" was literally the owner of these dummy resources.
> Please fix this to report something that isn't going frighten users
> and make small children cry.  Thanks,

Yeah, I also noticed that:-). Now I'm trying to find a proper
name. What do you think about "vfio-pci (dummy)"?

Thanks,
Yongji

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson June 30, 2016, 2:53 a.m. UTC | #3
On Thu, 30 Jun 2016 10:40:23 +0800
Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:

> Hi Alex,
> 
> On 2016/6/30 4:03, Alex Williamson wrote:
> 
> > On Tue, 28 Jun 2016 13:47:23 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >  
> >> On Tue, 28 Jun 2016 18:09:46 +0800
> >> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
> >>  
> >>> Hi, Alex
> >>>
> >>> On 2016/6/25 0:43, Alex Williamson wrote:
> >>>      
> >>>> On Fri, 24 Jun 2016 23:37:02 +0800
> >>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
> >>>>       
> >>>>> Hi, Alex
> >>>>>
> >>>>> On 2016/6/24 11:37, Alex Williamson wrote:
> >>>>>       
> >>>>>> On Fri, 24 Jun 2016 10:52:58 +0800
> >>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:  
> >>>>>>> On 2016/6/24 0:12, Alex Williamson wrote:  
> >>>>>>>> On Mon, 30 May 2016 21:06:37 +0800
> >>>>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:  
> >>>>>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >>>>>>>>> +{
> >>>>>>>>> +	struct resource *res;
> >>>>>>>>> +	int bar;
> >>>>>>>>> +	struct vfio_pci_dummy_resource *dummy_res;
> >>>>>>>>> +
> >>>>>>>>> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
> >>>>>>>>> +
> >>>>>>>>> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> >>>>>>>>> +		res = vdev->pdev->resource + bar;
> >>>>>>>>> +
> >>>>>>>>> +		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> >>>>>>>>> +			goto no_mmap;
> >>>>>>>>> +
> >>>>>>>>> +		if (!(res->flags & IORESOURCE_MEM))
> >>>>>>>>> +			goto no_mmap;
> >>>>>>>>> +
> >>>>>>>>> +		/*
> >>>>>>>>> +		 * The PCI core shouldn't set up a resource with a
> >>>>>>>>> +		 * type but zero size. But there may be bugs that
> >>>>>>>>> +		 * cause us to do that.
> >>>>>>>>> +		 */
> >>>>>>>>> +		if (!resource_size(res))
> >>>>>>>>> +			goto no_mmap;
> >>>>>>>>> +
> >>>>>>>>> +		if (resource_size(res) >= PAGE_SIZE) {
> >>>>>>>>> +			vdev->bar_mmap_supported[bar] = true;
> >>>>>>>>> +			continue;
> >>>>>>>>> +		}
> >>>>>>>>> +
> >>>>>>>>> +		if (!(res->start & ~PAGE_MASK)) {
> >>>>>>>>> +			/*
> >>>>>>>>> +			 * Add a dummy resource to reserve the remainder
> >>>>>>>>> +			 * of the exclusive page in case that hot-add
> >>>>>>>>> +			 * device's bar is assigned into it.
> >>>>>>>>> +			 */
> >>>>>>>>> +			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> >>>>>>>>> +			if (dummy_res == NULL)
> >>>>>>>>> +				goto no_mmap;
> >>>>>>>>> +
> >>>>>>>>> +			dummy_res->resource.start = res->end + 1;
> >>>>>>>>> +			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> >>>>>>>>> +			dummy_res->resource.flags = res->flags;
> >>>>>>>>> +			if (request_resource(res->parent,
> >>>>>>>>> +						&dummy_res->resource)) {
> >>>>>>>>> +				kfree(dummy_res);
> >>>>>>>>> +				goto no_mmap;
> >>>>>>>>> +			}  
> >>>>>>>> Isn't it true that request_resource() only tells us that at a given
> >>>>>>>> point in time, no other drivers have reserved that resource?  It seems
> >>>>>>>> like it does not guarantee that the resource isn't routed to another
> >>>>>>>> device or that another driver won't at some point attempt to request
> >>>>>>>> that same resource.  So for example if a user constructs their initrd
> >>>>>>>> to bind vfio-pci to devices before other modules load, this
> >>>>>>>> request_resource() may succeed, at the expense of drivers loaded later
> >>>>>>>> now failing.  The behavior will depend on driver load order and we're
> >>>>>>>> not actually insuring that the overflow resource is unused, just that
> >>>>>>>> we got it first.  Can we do better?  Am I missing something that
> >>>>>>>> prevents this?  Thanks,
> >>>>>>>>
> >>>>>>>> Alex  
> >>>>>>> Couldn't PCI resources allocator prevent this, which will find a
> >>>>>>> empty slot in the resource tree firstly, then try to request that
> >>>>>>> resource in allocate_resource() when a PCI device is probed.
> >>>>>>> And I'd like to know why a PCI device driver would attempt to
> >>>>>>> call request_resource()? Should this be done in PCI enumeration?  
> >>>>>> Hi Yongji,
> >>>>>>
> >>>>>> Looks like most pci drivers call pci_request_regions().  From there the
> >>>>>> call path is:
> >>>>>>
> >>>>>> pci_request_selected_regions
> >>>>>>      __pci_request_selected_regions
> >>>>>>        __pci_request_region
> >>>>>>          __request_mem_region
> >>>>>>            __request_region
> >>>>>>              __request_resource
> >>>>>>
> >>>>>> We see this driver ordering issue sometimes with users attempting to
> >>>>>> blacklist native pci drivers, trying to leave a device free for use by
> >>>>>> vfio-pci.  If the device is a graphics card, the generic vesa or uefi
> >>>>>> driver can request device resources causing a failure when vfio-pci
> >>>>>> tries to request those same resources.  I expect that unless it's a
> >>>>>> boot device, like vga in my example, the resources are not enabled
> >>>>>> until the driver opens the device, therefore the request_resource() call
> >>>>>> doesn't occur until that point.
> >>>>>>
> >>>>>> For another trivial example, look at /proc/iomem as you load and unload
> >>>>>> a driver, on my laptop with e1000e unloaded I see:
> >>>>>>
> >>>>>>      e1200000-e121ffff : 0000:00:19.0
> >>>>>>      e123e000-e123efff : 0000:00:19.0
> >>>>>>
> >>>>>> When e1000e is loaded, each of these becomes claimed by the e1000e
> >>>>>> driver:
> >>>>>>
> >>>>>>      e1200000-e121ffff : 0000:00:19.0
> >>>>>>        e1200000-e121ffff : e1000e
> >>>>>>      e123e000-e123efff : 0000:00:19.0
> >>>>>>        e123e000-e123efff : e1000e
> >>>>>>
> >>>>>> Clearly pci core knows the resource is associated with the device, but
> >>>>>> I don't think we're tapping into that with request_resource(), we're
> >>>>>> just potentially stealing resources that another driver might have
> >>>>>> claimed otherwise as I described above.  That's my suspicion at
> >>>>>> least, feel free to show otherwise if it's incorrect.  Thanks,
> >>>>>>
> >>>>>> Alex
> >>>>>>          
> >>>>> Thanks for your explanation. But I still have one question.
> >>>>> Shouldn't PCI core have claimed all PCI device's resources
> >>>>> after probing those devices. If so, request_resource() will fail
> >>>>> when vfio-pci try to steal resources that another driver might
> >>>>> request later. Anything I missed here?  Some device resources
> >>>>> would not be claimed in PCI core?  
> >>>> Hi Yongji,
> >>>>
> >>>> I don't know what to say, this is not how the interface currently
> >>>> works.  request_resource() is a driver level interface that tries to
> >>>> prevent drivers from claiming conflicting resources.  In this patch
> >>>> you're trying to use it to probe whether a resource maps to another
> >>>> device.  This is not what it does.  request_resource() will happily let
> >>>> you claim any resource you want, so long as nobody else claimed it
> >>>> first.  So the only case where the assumptions in this patch are valid
> >>>> is if we can guarantee that any potentially conflicting device has a
> >>>> driver loaded that has claimed those resources.  As it is here,
> >>>> vfio-pci will happily attempt to request resources that might overlap
> >>>> with another device and might break drivers that haven't yet had a
> >>>> chance to probe their devices.  I don't think that's acceptable.
> >>>> Thanks,
> >>>>
> >>>> Alex
> >>>>       
> >>> I'm trying to get your point. Let me give an example here.
> >>> I'm not sure whether my understanding is right. Please
> >>> point it out if I'm wrong.
> >>>
> >>> We assume that there are two PCI devices like this:
> >>>
> >>> 240000000000-24feffffffff : /pciex@3fffe40400000
> >>>     240000000000-2400ffffffff : PCI Bus 0002:01
> >>>       240000000000-240000007fff : 0002:01:00.0
> >>>         240000000000-240000007fff : vfio-pci
> >>>       240000008000-24000000ffff : 0002:01:01.0
> >>>         240000008000-24000000ffff : lpfc
> >>>
> >>> Do you mean vfio-pci driver will succeed in requesting
> >>> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K)
> >>> if it is loaded before lpfc driver? Like this:
> >>>
> >>> 240000000000-24feffffffff : /pciex@3fffe40400000
> >>>     240000000000-2400ffffffff : PCI Bus 0002:01
> >>>       240000000000-240000007fff : 0002:01:00.0
> >>>         240000000000-240000007fff : vfio-pci
> >>>       240000008000-24000000ffff : 0002:01:01.0
> >>>         240000008000-24000000ffff : <BAD>    --> vfio-pci call
> >>> request_resource()
> >>>
> >>> Then lpfc driver will fail when it attempts to call
> >>> pci_request_regions() later.  
> >> Yes, that is my supposition.
> >>     
> >>> But is it possible that the dummy_res become the child of
> >>> the res: 0002:01:01.0? Wouldn't request_resource() fail when
> >>> it found parent res: PCI Bus 0002:01 already have conflict
> >>> child res: 0002:01:01.0.
> >>>
> >>> And I think the case that request_resource() will succeed
> >>> should like this:
> >>>
> >>> 240000000000-24feffffffff : /pciex@3fffe40400000
> >>>     240000000000-2400ffffffff : PCI Bus 0002:01
> >>>       240000000000-240000007fff : 0002:01:00.0
> >>>       240000010000-240000017fff : 0002:01:01.0
> >>>
> >>> There is a mem hole: [240000008000-24000000ffff] after
> >>> PCI probing.  After loading drivers, the resources tree
> >>> will be:
> >>>
> >>> 240000000000-24feffffffff : /pciex@3fffe40400000
> >>>     240000000000-2400ffffffff : PCI Bus 0002:01
> >>>       240000000000-240000007fff : 0002:01:00.0
> >>>         240000000000-240000007fff : vfio-pci
> >>>       240000008000-24000000ffff : <BAD>    ---> vfio-pci call
> >>> request_resource()
> >>>       240000010000-240000017fff : 0002:01:01.0
> >>>         240000010000-240000017fff : lpfc  
> >> Ok, let's stop guessing about this.  I modified your patch as follows
> >> so I could easily test it on a 4k host:
> >>
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> >>   	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> >>   }
> >>   
> >> +#define VFIO_64K_PAGE_SIZE (64*1024)
> >> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1))
> >> +
> >>   static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >>   {
> >>   	struct resource *res;
> >> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >>   		if (!resource_size(res))
> >>   			goto no_mmap;
> >>   
> >> -		if (resource_size(res) >= PAGE_SIZE) {
> >> +		if (resource_size(res) >= VFIO_64K_PAGE_SIZE) {
> >>   			vdev->bar_mmap_supported[bar] = true;
> >>   			continue;
> >>   		}
> >>   
> >> -		if (!(res->start & ~PAGE_MASK)) {
> >> +		if (!(res->start & ~VFIO_64K_PAGE_MASK)) {
> >> +			int ret;
> >>   			/*
> >>   			 * Add a dummy resource to reserve the remainder
> >>   			 * of the exclusive page in case that hot-add
> >> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >>   				goto no_mmap;
> >>   
> >>   			dummy_res->resource.start = res->end + 1;
> >> -			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> >> +			dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1;
> >>   			dummy_res->resource.flags = res->flags;
> >> -			if (request_resource(res->parent,
> >> -						&dummy_res->resource)) {
> >> +			ret = request_resource(res->parent,
> >> +						&dummy_res->resource);
> >> +			if (ret) {
> >> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret);
> >>   				kfree(dummy_res);
> >>   				goto no_mmap;
> >>   			}
> >>
> >> IOW, enforce 64k for mmap regardless of PAGE_SIZE.  Then I find a
> >> system configuration to test it:
> >>
> >>    ee400000-ef4fffff : PCI Bus 0000:07
> >>      ef480000-ef49ffff : 0000:07:00.0
> >>        ef480000-ef483fff : 0000:08:10.0
> >>        ef484000-ef487fff : 0000:08:10.2
> >>        ef488000-ef48bfff : 0000:08:10.4
> >>        ef48c000-ef48ffff : 0000:08:10.6
> >>        ef490000-ef493fff : 0000:08:11.0
> >>        ef494000-ef497fff : 0000:08:11.2
> >>        ef498000-ef49bfff : 0000:08:11.4
> >>      ef4a0000-ef4bffff : 0000:07:00.0
> >>        ef4a0000-ef4a3fff : 0000:08:10.0
> >>        ef4a4000-ef4a7fff : 0000:08:10.2
> >>        ef4a8000-ef4abfff : 0000:08:10.4
> >>        ef4ac000-ef4affff : 0000:08:10.6
> >>        ef4b0000-ef4b3fff : 0000:08:11.0
> >>        ef4b4000-ef4b7fff : 0000:08:11.2
> >>        ef4b8000-ef4bbfff : 0000:08:11.4
> >>
> >> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all
> >> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a
> >> separate contiguous range.  The igbvf driver is not loaded, so the
> >> other resources are free to be claimed, what happens?
> >>
> >> It looks like you're right, the request_resource() fails with:
> >>
> >> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16)
> >> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16)
> >>
> >> So we get back -EBUSY which means we hit a conflict.  I would have
> >> expected that this means our res->parent that we're using for
> >> request_resource() is only, for instance, ef480000-ef483fff (ie. the
> >> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of
> >> the parent, but adding the parent resource range to the dev_info(), it
> >> actually shows the range being ef480000-ef49ffff, so the parent is
> >> actually the 07:00.0 resource.  In fact, we can't even use
> >> request_resource() like this to claim the BAR itself, which is why we
> >> use pci_request_selected_regions(), which allows conflicts, putting the
> >> requested resource at the leaf of the tree.
> >>
> >> So I guess I retract this concern about the use of request_resource(),
> >> it does seem to behave as intended.  Unless I can spot anything else or
> >> other reviewers have comments, I'll queue this into my next branch for
> >> v4.8.  Thanks,  
> >
> > Ok, one more test, I found that I have access to the following USB
> > devices:
> >
> > 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI])
> > 	Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K]
> >
> > 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI])
> > 	Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K]
> >
> > These are nicely mapped such that vfio-pci can claim the residual space
> > from the page, which results in the following in /proc/iomem:
> >
> >    f7a07000-f7a073ff : 0000:00:1d.0
> >      f7a07000-f7a073ff : vfio
> >    f7a07400-f7a07fff : <BAD>
> >    f7a08000-f7a083ff : 0000:00:1a.0
> >      f7a08000-f7a083ff : vfio
> >    f7a08400-f7a08fff : <BAD>
> >
> > I should have looked more closely at your previous reply, I didn't
> > think that "<BAD>" was literally the owner of these dummy resources.
> > Please fix this to report something that isn't going frighten users
> > and make small children cry.  Thanks,  
> 
> Yeah, I also noticed that:-). Now I'm trying to find a proper
> name. What do you think about "vfio-pci (dummy)"?

How about "vfio sub-page reserved"?  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yongji Xie June 30, 2016, 3:29 a.m. UTC | #4
On 2016/6/30 10:53, Alex Williamson wrote:

> On Thu, 30 Jun 2016 10:40:23 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>
>> Hi Alex,
>>
>> On 2016/6/30 4:03, Alex Williamson wrote:
>>
>>> On Tue, 28 Jun 2016 13:47:23 -0600
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>   
>>>> On Tue, 28 Jun 2016 18:09:46 +0800
>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>>>   
>>>>> Hi, Alex
>>>>>
>>>>> On 2016/6/25 0:43, Alex Williamson wrote:
>>>>>       
>>>>>> On Fri, 24 Jun 2016 23:37:02 +0800
>>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>>>>>        
>>>>>>> Hi, Alex
>>>>>>>
>>>>>>> On 2016/6/24 11:37, Alex Williamson wrote:
>>>>>>>        
>>>>>>>> On Fri, 24 Jun 2016 10:52:58 +0800
>>>>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>>>>>>>> On 2016/6/24 0:12, Alex Williamson wrote:
>>>>>>>>>> On Mon, 30 May 2016 21:06:37 +0800
>>>>>>>>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>>>>>>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>>>>>>>>> +{
>>>>>>>>>>> +	struct resource *res;
>>>>>>>>>>> +	int bar;
>>>>>>>>>>> +	struct vfio_pci_dummy_resource *dummy_res;
>>>>>>>>>>> +
>>>>>>>>>>> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
>>>>>>>>>>> +
>>>>>>>>>>> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
>>>>>>>>>>> +		res = vdev->pdev->resource + bar;
>>>>>>>>>>> +
>>>>>>>>>>> +		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
>>>>>>>>>>> +			goto no_mmap;
>>>>>>>>>>> +
>>>>>>>>>>> +		if (!(res->flags & IORESOURCE_MEM))
>>>>>>>>>>> +			goto no_mmap;
>>>>>>>>>>> +
>>>>>>>>>>> +		/*
>>>>>>>>>>> +		 * The PCI core shouldn't set up a resource with a
>>>>>>>>>>> +		 * type but zero size. But there may be bugs that
>>>>>>>>>>> +		 * cause us to do that.
>>>>>>>>>>> +		 */
>>>>>>>>>>> +		if (!resource_size(res))
>>>>>>>>>>> +			goto no_mmap;
>>>>>>>>>>> +
>>>>>>>>>>> +		if (resource_size(res) >= PAGE_SIZE) {
>>>>>>>>>>> +			vdev->bar_mmap_supported[bar] = true;
>>>>>>>>>>> +			continue;
>>>>>>>>>>> +		}
>>>>>>>>>>> +
>>>>>>>>>>> +		if (!(res->start & ~PAGE_MASK)) {
>>>>>>>>>>> +			/*
>>>>>>>>>>> +			 * Add a dummy resource to reserve the remainder
>>>>>>>>>>> +			 * of the exclusive page in case that hot-add
>>>>>>>>>>> +			 * device's bar is assigned into it.
>>>>>>>>>>> +			 */
>>>>>>>>>>> +			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
>>>>>>>>>>> +			if (dummy_res == NULL)
>>>>>>>>>>> +				goto no_mmap;
>>>>>>>>>>> +
>>>>>>>>>>> +			dummy_res->resource.start = res->end + 1;
>>>>>>>>>>> +			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>>>>>>>>>>> +			dummy_res->resource.flags = res->flags;
>>>>>>>>>>> +			if (request_resource(res->parent,
>>>>>>>>>>> +						&dummy_res->resource)) {
>>>>>>>>>>> +				kfree(dummy_res);
>>>>>>>>>>> +				goto no_mmap;
>>>>>>>>>>> +			}
>>>>>>>>>> Isn't it true that request_resource() only tells us that at a given
>>>>>>>>>> point in time, no other drivers have reserved that resource?  It seems
>>>>>>>>>> like it does not guarantee that the resource isn't routed to another
>>>>>>>>>> device or that another driver won't at some point attempt to request
>>>>>>>>>> that same resource.  So for example if a user constructs their initrd
>>>>>>>>>> to bind vfio-pci to devices before other modules load, this
>>>>>>>>>> request_resource() may succeed, at the expense of drivers loaded later
>>>>>>>>>> now failing.  The behavior will depend on driver load order and we're
>>>>>>>>>> not actually insuring that the overflow resource is unused, just that
>>>>>>>>>> we got it first.  Can we do better?  Am I missing something that
>>>>>>>>>> prevents this?  Thanks,
>>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>> Couldn't PCI resources allocator prevent this, which will find a
>>>>>>>>> empty slot in the resource tree firstly, then try to request that
>>>>>>>>> resource in allocate_resource() when a PCI device is probed.
>>>>>>>>> And I'd like to know why a PCI device driver would attempt to
>>>>>>>>> call request_resource()? Should this be done in PCI enumeration?
>>>>>>>> Hi Yongji,
>>>>>>>>
>>>>>>>> Looks like most pci drivers call pci_request_regions().  From there the
>>>>>>>> call path is:
>>>>>>>>
>>>>>>>> pci_request_selected_regions
>>>>>>>>       __pci_request_selected_regions
>>>>>>>>         __pci_request_region
>>>>>>>>           __request_mem_region
>>>>>>>>             __request_region
>>>>>>>>               __request_resource
>>>>>>>>
>>>>>>>> We see this driver ordering issue sometimes with users attempting to
>>>>>>>> blacklist native pci drivers, trying to leave a device free for use by
>>>>>>>> vfio-pci.  If the device is a graphics card, the generic vesa or uefi
>>>>>>>> driver can request device resources causing a failure when vfio-pci
>>>>>>>> tries to request those same resources.  I expect that unless it's a
>>>>>>>> boot device, like vga in my example, the resources are not enabled
>>>>>>>> until the driver opens the device, therefore the request_resource() call
>>>>>>>> doesn't occur until that point.
>>>>>>>>
>>>>>>>> For another trivial example, look at /proc/iomem as you load and unload
>>>>>>>> a driver, on my laptop with e1000e unloaded I see:
>>>>>>>>
>>>>>>>>       e1200000-e121ffff : 0000:00:19.0
>>>>>>>>       e123e000-e123efff : 0000:00:19.0
>>>>>>>>
>>>>>>>> When e1000e is loaded, each of these becomes claimed by the e1000e
>>>>>>>> driver:
>>>>>>>>
>>>>>>>>       e1200000-e121ffff : 0000:00:19.0
>>>>>>>>         e1200000-e121ffff : e1000e
>>>>>>>>       e123e000-e123efff : 0000:00:19.0
>>>>>>>>         e123e000-e123efff : e1000e
>>>>>>>>
>>>>>>>> Clearly pci core knows the resource is associated with the device, but
>>>>>>>> I don't think we're tapping into that with request_resource(), we're
>>>>>>>> just potentially stealing resources that another driver might have
>>>>>>>> claimed otherwise as I described above.  That's my suspicion at
>>>>>>>> least, feel free to show otherwise if it's incorrect.  Thanks,
>>>>>>>>
>>>>>>>> Alex
>>>>>>>>           
>>>>>>> Thanks for your explanation. But I still have one question.
>>>>>>> Shouldn't PCI core have claimed all PCI device's resources
>>>>>>> after probing those devices. If so, request_resource() will fail
>>>>>>> when vfio-pci try to steal resources that another driver might
>>>>>>> request later. Anything I missed here?  Some device resources
>>>>>>> would not be claimed in PCI core?
>>>>>> Hi Yongji,
>>>>>>
>>>>>> I don't know what to say, this is not how the interface currently
>>>>>> works.  request_resource() is a driver level interface that tries to
>>>>>> prevent drivers from claiming conflicting resources.  In this patch
>>>>>> you're trying to use it to probe whether a resource maps to another
>>>>>> device.  This is not what it does.  request_resource() will happily let
>>>>>> you claim any resource you want, so long as nobody else claimed it
>>>>>> first.  So the only case where the assumptions in this patch are valid
>>>>>> is if we can guarantee that any potentially conflicting device has a
>>>>>> driver loaded that has claimed those resources.  As it is here,
>>>>>> vfio-pci will happily attempt to request resources that might overlap
>>>>>> with another device and might break drivers that haven't yet had a
>>>>>> chance to probe their devices.  I don't think that's acceptable.
>>>>>> Thanks,
>>>>>>
>>>>>> Alex
>>>>>>        
>>>>> I'm trying to get your point. Let me give an example here.
>>>>> I'm not sure whether my understanding is right. Please
>>>>> point it out if I'm wrong.
>>>>>
>>>>> We assume that there are two PCI devices like this:
>>>>>
>>>>> 240000000000-24feffffffff : /pciex@3fffe40400000
>>>>>      240000000000-2400ffffffff : PCI Bus 0002:01
>>>>>        240000000000-240000007fff : 0002:01:00.0
>>>>>          240000000000-240000007fff : vfio-pci
>>>>>        240000008000-24000000ffff : 0002:01:01.0
>>>>>          240000008000-24000000ffff : lpfc
>>>>>
>>>>> Do you mean vfio-pci driver will succeed in requesting
>>>>> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K)
>>>>> if it is loaded before lpfc driver? Like this:
>>>>>
>>>>> 240000000000-24feffffffff : /pciex@3fffe40400000
>>>>>      240000000000-2400ffffffff : PCI Bus 0002:01
>>>>>        240000000000-240000007fff : 0002:01:00.0
>>>>>          240000000000-240000007fff : vfio-pci
>>>>>        240000008000-24000000ffff : 0002:01:01.0
>>>>>          240000008000-24000000ffff : <BAD>    --> vfio-pci call
>>>>> request_resource()
>>>>>
>>>>> Then lpfc driver will fail when it attempts to call
>>>>> pci_request_regions() later.
>>>> Yes, that is my supposition.
>>>>      
>>>>> But is it possible that the dummy_res become the child of
>>>>> the res: 0002:01:01.0? Wouldn't request_resource() fail when
>>>>> it found parent res: PCI Bus 0002:01 already have conflict
>>>>> child res: 0002:01:01.0.
>>>>>
>>>>> And I think the case that request_resource() will succeed
>>>>> should like this:
>>>>>
>>>>> 240000000000-24feffffffff : /pciex@3fffe40400000
>>>>>      240000000000-2400ffffffff : PCI Bus 0002:01
>>>>>        240000000000-240000007fff : 0002:01:00.0
>>>>>        240000010000-240000017fff : 0002:01:01.0
>>>>>
>>>>> There is a mem hole: [240000008000-24000000ffff] after
>>>>> PCI probing.  After loading drivers, the resources tree
>>>>> will be:
>>>>>
>>>>> 240000000000-24feffffffff : /pciex@3fffe40400000
>>>>>      240000000000-2400ffffffff : PCI Bus 0002:01
>>>>>        240000000000-240000007fff : 0002:01:00.0
>>>>>          240000000000-240000007fff : vfio-pci
>>>>>        240000008000-24000000ffff : <BAD>    ---> vfio-pci call
>>>>> request_resource()
>>>>>        240000010000-240000017fff : 0002:01:01.0
>>>>>          240000010000-240000017fff : lpfc
>>>> Ok, let's stop guessing about this.  I modified your patch as follows
>>>> so I could easily test it on a 4k host:
>>>>
>>>> --- a/drivers/vfio/pci/vfio_pci.c
>>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>>> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>>>>    	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>>>>    }
>>>>    
>>>> +#define VFIO_64K_PAGE_SIZE (64*1024)
>>>> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1))
>>>> +
>>>>    static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>>    {
>>>>    	struct resource *res;
>>>> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>>    		if (!resource_size(res))
>>>>    			goto no_mmap;
>>>>    
>>>> -		if (resource_size(res) >= PAGE_SIZE) {
>>>> +		if (resource_size(res) >= VFIO_64K_PAGE_SIZE) {
>>>>    			vdev->bar_mmap_supported[bar] = true;
>>>>    			continue;
>>>>    		}
>>>>    
>>>> -		if (!(res->start & ~PAGE_MASK)) {
>>>> +		if (!(res->start & ~VFIO_64K_PAGE_MASK)) {
>>>> +			int ret;
>>>>    			/*
>>>>    			 * Add a dummy resource to reserve the remainder
>>>>    			 * of the exclusive page in case that hot-add
>>>> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>>    				goto no_mmap;
>>>>    
>>>>    			dummy_res->resource.start = res->end + 1;
>>>> -			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>>>> +			dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1;
>>>>    			dummy_res->resource.flags = res->flags;
>>>> -			if (request_resource(res->parent,
>>>> -						&dummy_res->resource)) {
>>>> +			ret = request_resource(res->parent,
>>>> +						&dummy_res->resource);
>>>> +			if (ret) {
>>>> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret);
>>>>    				kfree(dummy_res);
>>>>    				goto no_mmap;
>>>>    			}
>>>>
>>>> IOW, enforce 64k for mmap regardless of PAGE_SIZE.  Then I find a
>>>> system configuration to test it:
>>>>
>>>>     ee400000-ef4fffff : PCI Bus 0000:07
>>>>       ef480000-ef49ffff : 0000:07:00.0
>>>>         ef480000-ef483fff : 0000:08:10.0
>>>>         ef484000-ef487fff : 0000:08:10.2
>>>>         ef488000-ef48bfff : 0000:08:10.4
>>>>         ef48c000-ef48ffff : 0000:08:10.6
>>>>         ef490000-ef493fff : 0000:08:11.0
>>>>         ef494000-ef497fff : 0000:08:11.2
>>>>         ef498000-ef49bfff : 0000:08:11.4
>>>>       ef4a0000-ef4bffff : 0000:07:00.0
>>>>         ef4a0000-ef4a3fff : 0000:08:10.0
>>>>         ef4a4000-ef4a7fff : 0000:08:10.2
>>>>         ef4a8000-ef4abfff : 0000:08:10.4
>>>>         ef4ac000-ef4affff : 0000:08:10.6
>>>>         ef4b0000-ef4b3fff : 0000:08:11.0
>>>>         ef4b4000-ef4b7fff : 0000:08:11.2
>>>>         ef4b8000-ef4bbfff : 0000:08:11.4
>>>>
>>>> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all
>>>> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a
>>>> separate contiguous range.  The igbvf driver is not loaded, so the
>>>> other resources are free to be claimed, what happens?
>>>>
>>>> It looks like you're right, the request_resource() fails with:
>>>>
>>>> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16)
>>>> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16)
>>>>
>>>> So we get back -EBUSY which means we hit a conflict.  I would have
>>>> expected that this means our res->parent that we're using for
>>>> request_resource() is only, for instance, ef480000-ef483fff (ie. the
>>>> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of
>>>> the parent, but adding the parent resource range to the dev_info(), it
>>>> actually shows the range being ef480000-ef49ffff, so the parent is
>>>> actually the 07:00.0 resource.  In fact, we can't even use
>>>> request_resource() like this to claim the BAR itself, which is why we
>>>> use pci_request_selected_regions(), which allows conflicts, putting the
>>>> requested resource at the leaf of the tree.
>>>>
>>>> So I guess I retract this concern about the use of request_resource(),
>>>> it does seem to behave as intended.  Unless I can spot anything else or
>>>> other reviewers have comments, I'll queue this into my next branch for
>>>> v4.8.  Thanks,
>>> Ok, one more test, I found that I have access to the following USB
>>> devices:
>>>
>>> 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI])
>>> 	Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K]
>>>
>>> 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI])
>>> 	Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K]
>>>
>>> These are nicely mapped such that vfio-pci can claim the residual space
>>> from the page, which results in the following in /proc/iomem:
>>>
>>>     f7a07000-f7a073ff : 0000:00:1d.0
>>>       f7a07000-f7a073ff : vfio
>>>     f7a07400-f7a07fff : <BAD>
>>>     f7a08000-f7a083ff : 0000:00:1a.0
>>>       f7a08000-f7a083ff : vfio
>>>     f7a08400-f7a08fff : <BAD>
>>>
>>> I should have looked more closely at your previous reply, I didn't
>>> think that "<BAD>" was literally the owner of these dummy resources.
>>> Please fix this to report something that isn't going frighten users
>>> and make small children cry.  Thanks,
>> Yeah, I also noticed that:-). Now I'm trying to find a proper
>> name. What do you think about "vfio-pci (dummy)"?
> How about "vfio sub-page reserved"?  Thanks,

Sounds good. I'll send a new version soon.

Thanks,
Yongji

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -110,6 +110,9 @@  static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+#define VFIO_64K_PAGE_SIZE (64*1024)
+#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1))
+
 static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 {
 	struct resource *res;
@@ -135,12 +138,13 @@  static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 		if (!resource_size(res))
 			goto no_mmap;
 
-		if (resource_size(res) >= PAGE_SIZE) {
+		if (resource_size(res) >= VFIO_64K_PAGE_SIZE) {
 			vdev->bar_mmap_supported[bar] = true;
 			continue;
 		}
 
-		if (!(res->start & ~PAGE_MASK)) {
+		if (!(res->start & ~VFIO_64K_PAGE_MASK)) {
+			int ret;
 			/*
 			 * Add a dummy resource to reserve the remainder
 			 * of the exclusive page in case that hot-add
@@ -151,10 +155,12 @@  static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 				goto no_mmap;
 
 			dummy_res->resource.start = res->end + 1;
-			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+			dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1;
 			dummy_res->resource.flags = res->flags;
-			if (request_resource(res->parent,
-						&dummy_res->resource)) {
+			ret = request_resource(res->parent,
+						&dummy_res->resource);
+			if (ret) {
+dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret);
 				kfree(dummy_res);
 				goto no_mmap;
 			}