Message ID | 155805321833.867447.3864104616303535270.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead | expand |
Let's add Kees to CC for usercopy expertise... On Thu 16-05-19 17:33:38, Dan Williams wrote: > Jeff discovered that performance improves from ~375K iops to ~519K iops > on a simple psync-write fio workload when moving the location of 'struct > page' from the default PMEM location to DRAM. This result is surprising > because the expectation is that 'struct page' for dax is only needed for > third party references to dax mappings. For example, a dax-mapped buffer > passed to another system call for direct-I/O requires 'struct page' for > sending the request down the driver stack and pinning the page. There is > no usage of 'struct page' for first party access to a file via > read(2)/write(2) and friends. > > However, this "no page needed" expectation is violated by > CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in > copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The > check_heap_object() helper routine assumes the buffer is backed by a > page-allocator DRAM page and applies some checks. Those checks are > invalid, dax pages are not from the heap, and redundant, > dax_iomap_actor() has already validated that the I/O is within bounds. So this last paragraph is not obvious to me as check_copy_size() does a lot of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of those checks don't make sense for PMEM pages but I'd rather handle that by refining check_copy_size() and check_object_size() functions to detect and appropriately handle pmem pages rather that generally skip all the checks in pmem_copy_from/to_iter(). And yes, every check in such hot path is going to cost performance but that's what user asked for with CONFIG_HARDENED_USERCOPY... Kees? Honza > > Bypass this overhead and call the 'no check' versions of the > copy_{to,from}_iter operations directly. > > Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...") > Cc: Jan Kara <jack@suse.cz> > Cc: <stable@vger.kernel.org> > Cc: Jeff Moyer <jmoyer@redhat.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Al Viro <viro@zeniv.linux.org.uk> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Matthew Wilcox <willy@infradead.org> > Reported-and-tested-by: Jeff Smits <jeff.smits@intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > drivers/nvdimm/pmem.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 845c5b430cdd..c894f45e5077 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -281,16 +281,21 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, > return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn); > } > > +/* > + * Use the 'no check' versions of copy_from_iter_flushcache() and > + * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds > + * checking is handled by dax_iomap_actor() > + */ > static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, > void *addr, size_t bytes, struct iov_iter *i) > { > - return copy_from_iter_flushcache(addr, bytes, i); > + return _copy_from_iter_flushcache(addr, bytes, i); > } > > static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, > void *addr, size_t bytes, struct iov_iter *i) > { > - return copy_to_iter_mcsafe(addr, bytes, i); > + return _copy_to_iter_mcsafe(addr, bytes, i); > } > > static const struct dax_operations pmem_dax_ops = { >
From: Jan Kara > Sent: 17 May 2019 09:48 ... > So this last paragraph is not obvious to me as check_copy_size() does a lot > of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of > those checks don't make sense for PMEM pages but I'd rather handle that by > refining check_copy_size() and check_object_size() functions to detect and > appropriately handle pmem pages rather that generally skip all the checks > in pmem_copy_from/to_iter(). And yes, every check in such hot path is going > to cost performance but that's what user asked for with > CONFIG_HARDENED_USERCOPY... Kees? Except that all the distros enable it by default. So you get the performance cost whether you (as a user) want it or not. I've changed some of our code to use __get_user() to avoid these stupid overheads. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, May 17, 2019 at 1:47 AM Jan Kara <jack@suse.cz> wrote: > > Let's add Kees to CC for usercopy expertise... > > On Thu 16-05-19 17:33:38, Dan Williams wrote: > > Jeff discovered that performance improves from ~375K iops to ~519K iops > > on a simple psync-write fio workload when moving the location of 'struct > > page' from the default PMEM location to DRAM. This result is surprising > > because the expectation is that 'struct page' for dax is only needed for > > third party references to dax mappings. For example, a dax-mapped buffer > > passed to another system call for direct-I/O requires 'struct page' for > > sending the request down the driver stack and pinning the page. There is > > no usage of 'struct page' for first party access to a file via > > read(2)/write(2) and friends. > > > > However, this "no page needed" expectation is violated by > > CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in > > copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The > > check_heap_object() helper routine assumes the buffer is backed by a > > page-allocator DRAM page and applies some checks. Those checks are > > invalid, dax pages are not from the heap, and redundant, > > dax_iomap_actor() has already validated that the I/O is within bounds. > > So this last paragraph is not obvious to me as check_copy_size() does a lot > of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of > those checks don't make sense for PMEM pages but I'd rather handle that by > refining check_copy_size() and check_object_size() functions to detect and > appropriately handle pmem pages rather that generally skip all the checks > in pmem_copy_from/to_iter(). And yes, every check in such hot path is going > to cost performance but that's what user asked for with > CONFIG_HARDENED_USERCOPY... Kees? As far as I can see it's mostly check_heap_object() that is the problem, so I'm open to finding a way to just bypass that sub-routine. However, as far as I can see none of the other block / filesystem user copy implementations submit to the hardened checks, like bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So, either those need to grow additional checks, or the hardened copy implementation is targeting single object copy use cases, not necessarily block-I/O. Yes, Kees, please advise.
On Fri, May 17, 2019 at 09:06:26AM +0000, David Laight wrote: > From: Jan Kara > > Sent: 17 May 2019 09:48 > ... > > So this last paragraph is not obvious to me as check_copy_size() does a lot > > of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of > > those checks don't make sense for PMEM pages but I'd rather handle that by > > refining check_copy_size() and check_object_size() functions to detect and > > appropriately handle pmem pages rather that generally skip all the checks > > in pmem_copy_from/to_iter(). And yes, every check in such hot path is going > > to cost performance but that's what user asked for with > > CONFIG_HARDENED_USERCOPY... Kees? > > Except that all the distros enable it by default. > So you get the performance cost whether you (as a user) want it or not. Note that it can be disabled on the kernel command line, but that seems like a last resort. :) > > I've changed some of our code to use __get_user() to avoid > these stupid overheads. __get_user() skips even access_ok() checking too, so that doesn't seem like a good idea. Did you run access_ok() checks separately? (This generally isn't recommended.) The usercopy hardening is intended to only kick in when the copy size isn't a builtin constant -- it's attempting to do a bounds check on the size, with the pointer used to figure out what bounds checking is possible (basically "is this stack? check stack location/frame size" or "is this kmem cache? check the allocation size") and then do bounds checks from there. It tries to bail out early to avoid needless checking, so if there is some additional logic to be added to check_object_size() that is globally applicable, sure, let's do it. I'm still not clear from this thread about the case that is getting tripped/slowed? If you're already doing bounds checks somewhere else and there isn't a chance for the pointer or size to change, then yeah, it seems safe to skip the usercopy size checks. But what's the actual code that is having a problem?
On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote: > As far as I can see it's mostly check_heap_object() that is the > problem, so I'm open to finding a way to just bypass that sub-routine. > However, as far as I can see none of the other block / filesystem user > copy implementations submit to the hardened checks, like > bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So, > either those need to grow additional checks, or the hardened copy > implementation is targeting single object copy use cases, not > necessarily block-I/O. Yes, Kees, please advise. The intention is mainly for copies that haven't had explicit bounds checking already performed on them, yes. Is there something getting checked out of the slab, or is it literally just the overhead of doing the "is this slab?" check that you're seeing?
From: Kees Cook > Sent: 17 May 2019 16:54 ... > > I've changed some of our code to use __get_user() to avoid > > these stupid overheads. > > __get_user() skips even access_ok() checking too, so that doesn't seem > like a good idea. Did you run access_ok() checks separately? (This > generally isn't recommended.) Of course, I'm not THAT stupid :-) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, May 17, 2019 at 04:14:03PM +0000, David Laight wrote: > From: Kees Cook > > Sent: 17 May 2019 16:54 > ... > > > I've changed some of our code to use __get_user() to avoid > > > these stupid overheads. > > > > __get_user() skips even access_ok() checking too, so that doesn't seem > > like a good idea. Did you run access_ok() checks separately? (This > > generally isn't recommended.) > > Of course, I'm not THAT stupid :-) Right, yes, I know. :) I just wanted to double-check since accidents can happen. The number of underscores on these function is not really a great way to indicate what they're doing. ;)
On Fri, May 17, 2019 at 8:57 AM Kees Cook <keescook@chromium.org> wrote: > > On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote: > > As far as I can see it's mostly check_heap_object() that is the > > problem, so I'm open to finding a way to just bypass that sub-routine. > > However, as far as I can see none of the other block / filesystem user > > copy implementations submit to the hardened checks, like > > bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So, > > either those need to grow additional checks, or the hardened copy > > implementation is targeting single object copy use cases, not > > necessarily block-I/O. Yes, Kees, please advise. > > The intention is mainly for copies that haven't had explicit bounds > checking already performed on them, yes. Is there something getting > checked out of the slab, or is it literally just the overhead of doing > the "is this slab?" check that you're seeing? It's literally the overhead of "is this slab?" since it needs to go retrieve the struct page and read that potentially cold cacheline. In the case where that page is on memory media that is higher latency than DRAM we get the ~37% performance loss that Jeff measured. The path is via the filesystem ->write_iter() file operation. In the DAX case the filesystem traps that path early, before submitting block I/O, and routes it to the dax_iomap_actor() routine. That routine validates that the logical file offset is within bounds of the file, then it does a sector-to-pfn translation which validates that the physical mapping is within bounds of the block device. It seems dax_iomap_actor() is not a path where we'd be worried about needing hardened user copy checks.
On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote: > On Fri, May 17, 2019 at 8:57 AM Kees Cook <keescook@chromium.org> wrote: > > > > On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote: > > > As far as I can see it's mostly check_heap_object() that is the > > > problem, so I'm open to finding a way to just bypass that sub-routine. > > > However, as far as I can see none of the other block / filesystem user > > > copy implementations submit to the hardened checks, like > > > bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So, > > > either those need to grow additional checks, or the hardened copy > > > implementation is targeting single object copy use cases, not > > > necessarily block-I/O. Yes, Kees, please advise. > > > > The intention is mainly for copies that haven't had explicit bounds > > checking already performed on them, yes. Is there something getting > > checked out of the slab, or is it literally just the overhead of doing > > the "is this slab?" check that you're seeing? > > It's literally the overhead of "is this slab?" since it needs to go > retrieve the struct page and read that potentially cold cacheline. In > the case where that page is on memory media that is higher latency > than DRAM we get the ~37% performance loss that Jeff measured. Ah-ha! Okay, I understand now; thanks! > The path is via the filesystem ->write_iter() file operation. In the > DAX case the filesystem traps that path early, before submitting block > I/O, and routes it to the dax_iomap_actor() routine. That routine > validates that the logical file offset is within bounds of the file, > then it does a sector-to-pfn translation which validates that the > physical mapping is within bounds of the block device. > > It seems dax_iomap_actor() is not a path where we'd be worried about > needing hardened user copy checks. I would agree: I think the proposed patch makes sense. :)
On Fri, May 17, 2019 at 12:25 PM Kees Cook <keescook@chromium.org> wrote: > On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote: > > It seems dax_iomap_actor() is not a path where we'd be worried about > > needing hardened user copy checks. > > I would agree: I think the proposed patch makes sense. :) Sounds like an acked-by to me.
On Sat 18-05-19 21:46:03, Dan Williams wrote: > On Fri, May 17, 2019 at 12:25 PM Kees Cook <keescook@chromium.org> wrote: > > On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote: > > > It seems dax_iomap_actor() is not a path where we'd be worried about > > > needing hardened user copy checks. > > > > I would agree: I think the proposed patch makes sense. :) > > Sounds like an acked-by to me. Yeah, if Kees agrees, I'm fine with skipping the checks as well. I just wanted that to be clarified. Also it helped me that you wrote: That routine (dax_iomap_actor()) validates that the logical file offset is within bounds of the file, then it does a sector-to-pfn translation which validates that the physical mapping is within bounds of the block device. That is more specific than "dax_iomap_actor() takes care of necessary checks" which was in the changelog. And the above paragraph helped me clarify which checks in dax_iomap_actor() you think replace those usercopy checks. So I think it would be good to add that paragraph to those copy_from_pmem() functions as a comment just in case we are wondering in the future why we are skipping the checks... Also feel free to add: Acked-by: Jan Kara <jack@suse.cz> Honza
On Mon, May 20, 2019 at 12:52 AM Jan Kara <jack@suse.cz> wrote: > > On Sat 18-05-19 21:46:03, Dan Williams wrote: > > On Fri, May 17, 2019 at 12:25 PM Kees Cook <keescook@chromium.org> wrote: > > > On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote: > > > > It seems dax_iomap_actor() is not a path where we'd be worried about > > > > needing hardened user copy checks. > > > > > > I would agree: I think the proposed patch makes sense. :) > > > > Sounds like an acked-by to me. > > Yeah, if Kees agrees, I'm fine with skipping the checks as well. I just > wanted that to be clarified. Also it helped me that you wrote: > > That routine (dax_iomap_actor()) validates that the logical file offset is > within bounds of the file, then it does a sector-to-pfn translation which > validates that the physical mapping is within bounds of the block device. > > That is more specific than "dax_iomap_actor() takes care of necessary > checks" which was in the changelog. And the above paragraph helped me > clarify which checks in dax_iomap_actor() you think replace those usercopy > checks. So I think it would be good to add that paragraph to those > copy_from_pmem() functions as a comment just in case we are wondering in > the future why we are skipping the checks... Also feel free to add: > > Acked-by: Jan Kara <jack@suse.cz> Will do, thanks Jan.
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 845c5b430cdd..c894f45e5077 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -281,16 +281,21 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn); } +/* + * Use the 'no check' versions of copy_from_iter_flushcache() and + * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds + * checking is handled by dax_iomap_actor() + */ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i) { - return copy_from_iter_flushcache(addr, bytes, i); + return _copy_from_iter_flushcache(addr, bytes, i); } static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i) { - return copy_to_iter_mcsafe(addr, bytes, i); + return _copy_to_iter_mcsafe(addr, bytes, i); } static const struct dax_operations pmem_dax_ops = {
Jeff discovered that performance improves from ~375K iops to ~519K iops on a simple psync-write fio workload when moving the location of 'struct page' from the default PMEM location to DRAM. This result is surprising because the expectation is that 'struct page' for dax is only needed for third party references to dax mappings. For example, a dax-mapped buffer passed to another system call for direct-I/O requires 'struct page' for sending the request down the driver stack and pinning the page. There is no usage of 'struct page' for first party access to a file via read(2)/write(2) and friends. However, this "no page needed" expectation is violated by CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The check_heap_object() helper routine assumes the buffer is backed by a page-allocator DRAM page and applies some checks. Those checks are invalid, dax pages are not from the heap, and redundant, dax_iomap_actor() has already validated that the I/O is within bounds. Bypass this overhead and call the 'no check' versions of the copy_{to,from}_iter operations directly. Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...") Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Reported-and-tested-by: Jeff Smits <jeff.smits@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/nvdimm/pmem.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)