Message ID | 20200221182503.28317-1-logang@deltatee.com (mailing list archive) |
---|---|
Headers | show |
Series | Allow setting caching mode in arch_add_memory() for P2PDMA | expand |
On Fri, Feb 21, 2020 at 11:24:56AM -0700, Logan Gunthorpe wrote: > Hi, > > This is v3 of the patchset which cleans up a number of minor issues > from the feedback of v2 and rebases onto v5.6-rc2. Additional feedback > is welcome. > > Thanks, > > Logan > > -- > > Changes in v3: > * Rebased onto v5.6-rc2 > * Rename mhp_modifiers to mhp_params per David with an updated kernel > doc per Dan > * Drop support for s390 per David seeing it does not support > ZONE_DEVICE yet and there was a potential problem with huge pages. > * Added WARN_ON_ONCE in cases where arches recieve non PAGE_KERNEL > parameters > * Collected David and Micheal's Reviewed-By and Acked-by Tags > > Changes in v2: > * Rebased onto v5.5-rc5 > * Renamed mhp_restrictions to mhp_modifiers and added the pgprot field > to that structure instead of using an argument for > arch_add_memory(). > * Add patch to drop the unused flags field in mhp_restrictions > > A git branch is available here: > > https://github.com/sbates130272/linux-p2pmem remap_pages_cache_v3 > > -- > > Currently, the page tables created using memremap_pages() are always > created with the PAGE_KERNEL cacheing mode. However, the P2PDMA code > is creating pages for PCI BAR memory which should never be accessed > through the cache and instead use either WC or UC. This still works in > most cases, on x86, because the MTRR registers typically override the > caching settings in the page tables for all of the IO memory to be > UC-. However, this tends not to work so well on other arches or > some rare x86 machines that have firmware which does not setup the > MTRR registers in this way. > > Instead of this, this series proposes a change to arch_add_memory() > to take the pgprot required by the mapping which allows us to > explicitly set pagetable entries for P2PDMA memory to WC. Is there a particular reason why WC was selected here? I thought for the p2pdma cases there was no kernel user that touched the memory? I definitely forsee devices where we want UC instead. Even so, the whole idea looks like the right direction to me. Jason
On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote: >> Instead of this, this series proposes a change to arch_add_memory() >> to take the pgprot required by the mapping which allows us to >> explicitly set pagetable entries for P2PDMA memory to WC. > > Is there a particular reason why WC was selected here? I thought for > the p2pdma cases there was no kernel user that touched the memory? Yes, that's correct. I choose WC here because the existing users are registering memory blocks without side effects which fit the WC semantics well. > I definitely forsee devices where we want UC instead. Yes. My expectation is that once we have a kernel user that needs this, we'd wire the option through struct dev_pagemap so the caller can choose the mapping that makes sense. Logan
On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote: > > > On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote: > >> Instead of this, this series proposes a change to arch_add_memory() > >> to take the pgprot required by the mapping which allows us to > >> explicitly set pagetable entries for P2PDMA memory to WC. > > > > Is there a particular reason why WC was selected here? I thought for > > the p2pdma cases there was no kernel user that touched the memory? > > Yes, that's correct. I choose WC here because the existing users are > registering memory blocks without side effects which fit the WC > semantics well. Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in Linux, so while it is true the memory has no side effects, there would be surprising concurrency risks if anything in the kernel tried to write to it. Not compatible means the locks don't contain stores to WC memory the way you would expect. AFAIK on many CPUs extra barriers are required to keep WC stores ordered, the same way ARM already has extra barriers to keep UC stores ordered with locking.. The spinlocks are defined to contain UC stores though. If there is no actual need today for WC I would suggest using UC as the default. Jason
On 2020-02-27 10:43 a.m., Jason Gunthorpe wrote: > Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in > Linux, so while it is true the memory has no side effects, there would > be surprising concurrency risks if anything in the kernel tried to > write to it. > > Not compatible means the locks don't contain stores to WC memory the > way you would expect. AFAIK on many CPUs extra barriers are required > to keep WC stores ordered, the same way ARM already has extra barriers > to keep UC stores ordered with locking.. > > The spinlocks are defined to contain UC stores though. > > If there is no actual need today for WC I would suggest using UC as > the default. Ok, that sounds sensible. I'll do that in the next revision. Thanks, Logan
On Thu, Feb 27, 2020 at 9:43 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote: > > > > > > On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote: > > >> Instead of this, this series proposes a change to arch_add_memory() > > >> to take the pgprot required by the mapping which allows us to > > >> explicitly set pagetable entries for P2PDMA memory to WC. > > > > > > Is there a particular reason why WC was selected here? I thought for > > > the p2pdma cases there was no kernel user that touched the memory? > > > > Yes, that's correct. I choose WC here because the existing users are > > registering memory blocks without side effects which fit the WC > > semantics well. > > Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in > Linux, so while it is true the memory has no side effects, there would > be surprising concurrency risks if anything in the kernel tried to > write to it. > > Not compatible means the locks don't contain stores to WC memory the > way you would expect. AFAIK on many CPUs extra barriers are required > to keep WC stores ordered, the same way ARM already has extra barriers > to keep UC stores ordered with locking.. > > The spinlocks are defined to contain UC stores though. How are spinlocks and mutexes getting into p2pdma ranges in the first instance? Even with UC, the system has bigger problems if it's trying to send bus locks targeting PCI, see the flurry of activity of trying to trigger faults on split locks [1]. This does raise a question about separating the cacheability of the 'struct page' memmap from the BAR range. You get this for free if the memmap is dynamically allocated from "System RAM", but perhaps memremap_pages() should explicitly prevent altmap configurations that try to place the map in PCI space? > If there is no actual need today for WC I would suggest using UC as > the default. That's reasonable, but it still seems to be making a broken configuration marginally less broken. I'd be more interested in safeguards that prevent p2pdma mappings from being used for any cpu atomic cycles. [1]: https://lwn.net/Articles/784864/
On Thu, Feb 27, 2020 at 09:55:04AM -0800, Dan Williams wrote: > On Thu, Feb 27, 2020 at 9:43 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote: > > > > > > > > > On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote: > > > >> Instead of this, this series proposes a change to arch_add_memory() > > > >> to take the pgprot required by the mapping which allows us to > > > >> explicitly set pagetable entries for P2PDMA memory to WC. > > > > > > > > Is there a particular reason why WC was selected here? I thought for > > > > the p2pdma cases there was no kernel user that touched the memory? > > > > > > Yes, that's correct. I choose WC here because the existing users are > > > registering memory blocks without side effects which fit the WC > > > semantics well. > > > > Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in > > Linux, so while it is true the memory has no side effects, there would > > be surprising concurrency risks if anything in the kernel tried to > > write to it. > > > > Not compatible means the locks don't contain stores to WC memory the > > way you would expect. AFAIK on many CPUs extra barriers are required > > to keep WC stores ordered, the same way ARM already has extra barriers > > to keep UC stores ordered with locking.. > > > > The spinlocks are defined to contain UC stores though. > > How are spinlocks and mutexes getting into p2pdma ranges in the first > instance? Even with UC, the system has bigger problems if it's trying > to send bus locks targeting PCI, see the flurry of activity of trying > to trigger faults on split locks [1]. This is not what I was trying to explain. Consider static spinlock lock; // CPU DRAM static idx = 0; u64 *wc_memory = [..]; spin_lock(&lock); wc_memory[0] = idx++; spin_unlock(&lock); You'd expect that the PCI device will observe stores where idx is strictly increasing, but this is not guarenteed. idx may decrease, idx may skip. It just won't duplicate. Or perhaps wc_memory[0] = foo; writel(doorbell) foo is not guarenteed observable by the device before doorbell reaches the device. All of these are things that do not happen with UC or NC memory, and are surprising violations of our programming model. Generic kernel code should never touch WC memory unless the code is specifically designed to handle it. Jason
On Thu, Feb 27, 2020 at 10:03 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Feb 27, 2020 at 09:55:04AM -0800, Dan Williams wrote: > > On Thu, Feb 27, 2020 at 9:43 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote: > > > > > > > > > > > > On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote: > > > > >> Instead of this, this series proposes a change to arch_add_memory() > > > > >> to take the pgprot required by the mapping which allows us to > > > > >> explicitly set pagetable entries for P2PDMA memory to WC. > > > > > > > > > > Is there a particular reason why WC was selected here? I thought for > > > > > the p2pdma cases there was no kernel user that touched the memory? > > > > > > > > Yes, that's correct. I choose WC here because the existing users are > > > > registering memory blocks without side effects which fit the WC > > > > semantics well. > > > > > > Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in > > > Linux, so while it is true the memory has no side effects, there would > > > be surprising concurrency risks if anything in the kernel tried to > > > write to it. > > > > > > Not compatible means the locks don't contain stores to WC memory the > > > way you would expect. AFAIK on many CPUs extra barriers are required > > > to keep WC stores ordered, the same way ARM already has extra barriers > > > to keep UC stores ordered with locking.. > > > > > > The spinlocks are defined to contain UC stores though. > > > > How are spinlocks and mutexes getting into p2pdma ranges in the first > > instance? Even with UC, the system has bigger problems if it's trying > > to send bus locks targeting PCI, see the flurry of activity of trying > > to trigger faults on split locks [1]. > > This is not what I was trying to explain. > > Consider > > static spinlock lock; // CPU DRAM > static idx = 0; > u64 *wc_memory = [..]; > > spin_lock(&lock); > wc_memory[0] = idx++; > spin_unlock(&lock); > > You'd expect that the PCI device will observe stores where idx is > strictly increasing, but this is not guarenteed. idx may decrease, idx > may skip. It just won't duplicate. > > Or perhaps > > wc_memory[0] = foo; > writel(doorbell) > > foo is not guarenteed observable by the device before doorbell reaches > the device. > > All of these are things that do not happen with UC or NC memory, and > are surprising violations of our programming model. > > Generic kernel code should never touch WC memory unless the code is > specifically designed to handle it. Ah, yes, agree.