Message ID | 20151002212137.GB30448@deltatee.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Fri, Oct 2, 2015 at 2:21 PM, Logan Gunthorpe <logang@deltatee.com> wrote: > Hi Dan, > > We've been doing some experimenting and testing with this patchset. > Specifically, we are trying to use you're ZONE_DEVICE work to enable > peer to peer PCIe transfers. This is actually working pretty well > (though we're still testing and working through some things). Hmm, I didn't have peer-to-peer PCI-E in mind for this mechanism, but the test report is welcome nonetheless. The definition of dma_addr_t is the device view of host memory, not necessarily the device view of a peer device's memory range, so I expect you'll run into issues with IOMMUs and other parts of the kernel that assume this definition. > > However, we've found a couple of issues: > > On Wed, Sep 23, 2015 at 12:42:27AM -0400, Dan Williams wrote: >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index 3d6baa7d4534..20097e7b679a 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -49,12 +49,16 @@ struct page { >> * updated asynchronously */ >> union { >> struct address_space *mapping; /* If low bit clear, points to >> - * inode address_space, or NULL. >> + * inode address_space, unless >> + * the page is in ZONE_DEVICE >> + * then it points to its parent >> + * dev_pagemap, otherwise NULL. >> * If page mapped as anonymous >> * memory, low bit is set, and >> * it points to anon_vma object: >> * see PAGE_MAPPING_ANON below. >> */ >> + struct dev_pagemap *pgmap; >> void *s_mem; /* slab first object */ >> }; > > > When you add to this union and overide the mapping value, we see bugs > in calls to set_page_dirty when it tries to dereference mapping. I believe > a change to page_mapping is required such as the patch that's at the end of > this email. Yes, this location for dev_pagemap will not work. I've since moved it to a union with the lru list_head since ZONE_DEVICE pages memory should always have an elevated page count and never land on a slab allocator lru. >> diff --git a/mm/gup.c b/mm/gup.c >> index a798293fc648..1064e9a489a4 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -98,7 +98,16 @@ retry: >> } >> >> page = vm_normal_page(vma, address, pte); >> - if (unlikely(!page)) { >> + if (!page && pte_devmap(pte) && (flags & FOLL_GET)) { >> + /* >> + * Only return device mapping pages in the FOLL_GET case since >> + * they are only valid while holding the pgmap reference. >> + */ >> + if (get_dev_pagemap(pte_pfn(pte), NULL)) >> + page = pte_page(pte); >> + else >> + goto no_page; >> + } else if (unlikely(!page)) { > > I've found that if a driver creates a ZONE_DEVICE mapping but doesn't > create the pagemap (using devm_register_pagemap) then the get_user_pages code > will go into an infinite loop. I'm not really sure if this as an issue or > not but it seems a bit undesirable for a buggy driver to be able to cause this. > > My thoughts are that either devm_register_pagemap needs to be done by > devm_memremap_pages so a driver cannot use one without the other, > or the GUP code needs to return EFAULT if no pagemap was registered so > it doesn't loop forever. Exactly, we should fail (-EFAULT) get_user_pages() in that case since we don't have a mechanism to pin down the mapping. I'll track down what's causing the loop.
Hi Dan, Good to know you've already addressed the struct page issue. We'll watch out for an updated patchset to try. On 02/10/15 03:53 PM, Dan Williams wrote: > Hmm, I didn't have peer-to-peer PCI-E in mind for this mechanism, but > the test report is welcome nonetheless. The definition of dma_addr_t > is the device view of host memory, not necessarily the device view of > a peer device's memory range, so I expect you'll run into issues with > IOMMUs and other parts of the kernel that assume this definition. Yeah, we've actually been doing this with a number of more "hacky" techniques for some time. ZONE_DEVICE just provides us with a much cleaner way to set this up that doesn't require patching around get_user_pages in various places in the kernel. We've never had any issues with the IOMMU getting in the way (at least on Intel x86). My understanding always was that the IOMMU sits between a PCI card and main memory; it doesn't get in the way of peer-to-peer transfers. Though admittedly, I don't have a complete understanding of how the IOMMU works in the kernel. I'm just speaking from experimental experience. We've never actually tried this on other architectures. Thanks, Logan
On 02/10/15 03:53 PM, Dan Williams wrote: > Yes, this location for dev_pagemap will not work. I've since moved it > to a union with the lru list_head since ZONE_DEVICE pages memory > should always have an elevated page count and never land on a slab > allocator lru. Oh, also, I was actually hoping to make use of the lru list_head in the future with ZONE_DEVICE memory. One thought I had was once we have a PCIe device with a BAR space, we'd then need to have a way of allocating these buffers when user space needs them. The simple way I was thinking was to just use the lru list head to store lists of used and unused pages -- though there are probably other solutions to this that don't require using struct pages. Thanks, Logan
On Fri, Oct 2, 2015 at 3:42 PM, Logan Gunthorpe <logang@deltatee.com> wrote: > > > On 02/10/15 03:53 PM, Dan Williams wrote: >> >> Yes, this location for dev_pagemap will not work. I've since moved it >> to a union with the lru list_head since ZONE_DEVICE pages memory >> should always have an elevated page count and never land on a slab >> allocator lru. > > > Oh, also, I was actually hoping to make use of the lru list_head in the > future with ZONE_DEVICE memory. One thought I had was once we have a PCIe > device with a BAR space, we'd then need to have a way of allocating these > buffers when user space needs them. The simple way I was thinking was to > just use the lru list head to store lists of used and unused pages -- though > there are probably other solutions to this that don't require using struct > pages. > The current assumption is the ZONE_DEVICE ranges are being managed by a physical address allocator. In the case of persistent memory this is the block allocator of the filesystem sitting on top of a pmem block device. The struct page is really only there to facilitate in-flight I/O requests. If it weren't for complexity we'd allocate them on demand. So you're "unused" case should be a raw pfn and then for the time-limited duration it is in use as a struct page it should hold a reference against the mapping.
diff --git a/mm/util.c b/mm/util.c index 68ff8a5..19af683 100644 --- a/mm/util.c +++ b/mm/util.c @@ -368,6 +368,9 @@ struct address_space *page_mapping(struct page *page) return swap_address_space(entry); } + if (unlikely(is_zone_device_page(page))) + return NULL; + mapping = (unsigned long)page->mapping; if (mapping & PAGE_MAPPING_FLAGS) return NULL;