Message ID | 1432739944-22633-13-git-send-email-toshi.kani@hp.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote: > The pmem driver maps NVDIMM with ioremap_nocache() as we cannot > write back the contents of the CPU caches in case of a crash. > > This patch changes to use ioremap_wt(), which provides uncached > writes but cached reads, for improving read performance. > > Signed-off-by: Toshi Kani <toshi.kani@hp.com> > --- > drivers/block/pmem.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c > index eabf4a8..095dfaa 100644 > --- a/drivers/block/pmem.c > +++ b/drivers/block/pmem.c > @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res) > } > > /* > - * Map the memory as non-cachable, as we can't write back the contents > + * Map the memory as write-through, as we can't write back the contents > * of the CPU caches in case of a crash. > */ > err = -ENOMEM; > - pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size); > + pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size); > if (!pmem->virt_addr) > goto out_release_region; Dan, Ross, what about this one? ACK to pick it up as a temporary solution?
On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote: > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote: >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot >> write back the contents of the CPU caches in case of a crash. >> >> This patch changes to use ioremap_wt(), which provides uncached >> writes but cached reads, for improving read performance. >> >> Signed-off-by: Toshi Kani <toshi.kani@hp.com> >> --- >> drivers/block/pmem.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c >> index eabf4a8..095dfaa 100644 >> --- a/drivers/block/pmem.c >> +++ b/drivers/block/pmem.c >> @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res) >> } >> >> /* >> - * Map the memory as non-cachable, as we can't write back the contents >> + * Map the memory as write-through, as we can't write back the contents >> * of the CPU caches in case of a crash. >> */ >> err = -ENOMEM; >> - pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size); >> + pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size); >> if (!pmem->virt_addr) >> goto out_release_region; > > Dan, Ross, what about this one? > > ACK to pick it up as a temporary solution? I see that is_new_memtype_allowed() is updated to disallow some combinations, but the manual seems to imply any mixing of memory types is unsupported. Which worries me even in the current code where we have uncached mappings in the driver, and potentially cached DAX mappings handed out to userspace. A general quibble separate from this patch is that we don't have a way of knowing if ioremap() will reject or change our requested memory type. Shouldn't the driver be explicitly requesting a known valid type in advance? Lastly we now have the PMEM API patches from Ross out for review where he is assuming cached mappings with non-temporal writes: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html. This gives us WC semantics on writes which I believe has the nice property of reducing the number of write transactions to memory. Also, the numbers in the paper seem to be assuming DAX operation, but this ioremap_wt() is in the driver and typically behind a file system. Are the numbers relevant to that usage mode?
On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote: > On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote: > > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote: > >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot > >> write back the contents of the CPU caches in case of a crash. > >> > >> This patch changes to use ioremap_wt(), which provides uncached > >> writes but cached reads, for improving read performance. > >> > >> Signed-off-by: Toshi Kani <toshi.kani@hp.com> > >> --- > >> drivers/block/pmem.c | 4 ++-- > >> 1 file changed, 2 insertions(+), 2 deletions(-) > >> > >> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c > >> index eabf4a8..095dfaa 100644 > >> --- a/drivers/block/pmem.c > >> +++ b/drivers/block/pmem.c > >> @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res) > >> } > >> > >> /* > >> - * Map the memory as non-cachable, as we can't write back the contents > >> + * Map the memory as write-through, as we can't write back the contents > >> * of the CPU caches in case of a crash. > >> */ > >> err = -ENOMEM; > >> - pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size); > >> + pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size); > >> if (!pmem->virt_addr) > >> goto out_release_region; > > > > Dan, Ross, what about this one? > > > > ACK to pick it up as a temporary solution? > > I see that is_new_memtype_allowed() is updated to disallow some > combinations, but the manual seems to imply any mixing of memory types > is unsupported. Which worries me even in the current code where we > have uncached mappings in the driver, and potentially cached DAX > mappings handed out to userspace. is_new_memtype_allowed() is not to allow some combinations of mixing of memory types. When it is allowed, the requested type of ioremap_xxx() is changed to match with the existing map type, so that mixing of memory types does not happen. DAX uses vm_insert_mixed(), which does not even check the existing map type to the physical address. > A general quibble separate from this patch is that we don't have a way > of knowing if ioremap() will reject or change our requested memory > type. Shouldn't the driver be explicitly requesting a known valid > type in advance? I agree we need a solution here. > Lastly we now have the PMEM API patches from Ross out for review where > he is assuming cached mappings with non-temporal writes: > https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html. > This gives us WC semantics on writes which I believe has the nice > property of reducing the number of write transactions to memory. > Also, the numbers in the paper seem to be assuming DAX operation, but > this ioremap_wt() is in the driver and typically behind a file system. > Are the numbers relevant to that usage mode? I have not looked into the Ross's changes yet, but they do not seem to replace the use of ioremap_nocache(). If his changes can use WB type reliably, yes, we do not need a temporary solution of using ioremap_wt() in this driver. Thanks, -Toshi
On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote: > On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote: >> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote: >> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote: >> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot >> >> write back the contents of the CPU caches in case of a crash. >> >> >> >> This patch changes to use ioremap_wt(), which provides uncached >> >> writes but cached reads, for improving read performance. >> >> >> >> Signed-off-by: Toshi Kani <toshi.kani@hp.com> >> >> --- >> >> drivers/block/pmem.c | 4 ++-- >> >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> >> >> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c >> >> index eabf4a8..095dfaa 100644 >> >> --- a/drivers/block/pmem.c >> >> +++ b/drivers/block/pmem.c >> >> @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res) >> >> } >> >> >> >> /* >> >> - * Map the memory as non-cachable, as we can't write back the contents >> >> + * Map the memory as write-through, as we can't write back the contents >> >> * of the CPU caches in case of a crash. >> >> */ >> >> err = -ENOMEM; >> >> - pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size); >> >> + pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size); >> >> if (!pmem->virt_addr) >> >> goto out_release_region; >> > >> > Dan, Ross, what about this one? >> > >> > ACK to pick it up as a temporary solution? >> >> I see that is_new_memtype_allowed() is updated to disallow some >> combinations, but the manual seems to imply any mixing of memory types >> is unsupported. Which worries me even in the current code where we >> have uncached mappings in the driver, and potentially cached DAX >> mappings handed out to userspace. > > is_new_memtype_allowed() is not to allow some combinations of mixing of > memory types. When it is allowed, the requested type of ioremap_xxx() > is changed to match with the existing map type, so that mixing of memory > types does not happen. Yes, but now if the caller was expecting one memory type and gets another one that is something I think the driver would want to know. At a minimum I don't think we want to get emails about pmem driver performance problems when someone's platform is silently degrading WB to UC for example. > DAX uses vm_insert_mixed(), which does not even check the existing map > type to the physical address. Right, I think that's a problem... >> A general quibble separate from this patch is that we don't have a way >> of knowing if ioremap() will reject or change our requested memory >> type. Shouldn't the driver be explicitly requesting a known valid >> type in advance? > > I agree we need a solution here. > >> Lastly we now have the PMEM API patches from Ross out for review where >> he is assuming cached mappings with non-temporal writes: >> https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html. >> This gives us WC semantics on writes which I believe has the nice >> property of reducing the number of write transactions to memory. >> Also, the numbers in the paper seem to be assuming DAX operation, but >> this ioremap_wt() is in the driver and typically behind a file system. >> Are the numbers relevant to that usage mode? > > I have not looked into the Ross's changes yet, but they do not seem to > replace the use of ioremap_nocache(). If his changes can use WB type > reliably, yes, we do not need a temporary solution of using ioremap_wt() > in this driver. Hmm, yes you're right, it seems those patches did not change the implementation to use ioremap_cache()... which happens to not be implemented on all architectures. I'll take a look.
On Fri, 2015-05-29 at 11:19 -0700, Dan Williams wrote: > On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote: > > On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote: > >> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote: > >> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote: > >> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot : > >> >> - pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size); > >> >> + pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size); > >> >> if (!pmem->virt_addr) > >> >> goto out_release_region; > >> > > >> > Dan, Ross, what about this one? > >> > > >> > ACK to pick it up as a temporary solution? > >> > >> I see that is_new_memtype_allowed() is updated to disallow some > >> combinations, but the manual seems to imply any mixing of memory types > >> is unsupported. Which worries me even in the current code where we > >> have uncached mappings in the driver, and potentially cached DAX > >> mappings handed out to userspace. > > > > is_new_memtype_allowed() is not to allow some combinations of mixing of > > memory types. When it is allowed, the requested type of ioremap_xxx() > > is changed to match with the existing map type, so that mixing of memory > > types does not happen. > > Yes, but now if the caller was expecting one memory type and gets > another one that is something I think the driver would want to know. > At a minimum I don't think we want to get emails about pmem driver > performance problems when someone's platform is silently degrading WB > to UC for example. The pmem driver creates an ioremap map to an NVDIMM range first. So, there will be no conflict at this point, unless there is a conflicting driver claiming the same NVDIMM range. DAX then uses the pmem driver (or other byte-addressable driver) to mount a file system and creates a separate user-space mapping for mmap(). So, a (silent) map-type conflict will happen at this point, which may not be protected by the ioremap itself. > > DAX uses vm_insert_mixed(), which does not even check the existing map > > type to the physical address. > > Right, I think that's a problem... > > >> A general quibble separate from this patch is that we don't have a way > >> of knowing if ioremap() will reject or change our requested memory > >> type. Shouldn't the driver be explicitly requesting a known valid > >> type in advance? > > > > I agree we need a solution here. > > > >> Lastly we now have the PMEM API patches from Ross out for review where > >> he is assuming cached mappings with non-temporal writes: > >> https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html. > >> This gives us WC semantics on writes which I believe has the nice > >> property of reducing the number of write transactions to memory. > >> Also, the numbers in the paper seem to be assuming DAX operation, but > >> this ioremap_wt() is in the driver and typically behind a file system. > >> Are the numbers relevant to that usage mode? > > > > I have not looked into the Ross's changes yet, but they do not seem to > > replace the use of ioremap_nocache(). If his changes can use WB type > > reliably, yes, we do not need a temporary solution of using ioremap_wt() > > in this driver. > > Hmm, yes you're right, it seems those patches did not change the > implementation to use ioremap_cache()... which happens to not be > implemented on all architectures. I'll take a look. Thanks, -Toshi
On Fri, May 29, 2015 at 11:19 AM, Dan Williams <dan.j.williams@intel.com> wrote: > On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote: >> On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote: >>> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote: >>> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote: >>> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot >>> >> write back the contents of the CPU caches in case of a crash. >>> >> >>> >> This patch changes to use ioremap_wt(), which provides uncached >>> >> writes but cached reads, for improving read performance. >>> >> >>> >> Signed-off-by: Toshi Kani <toshi.kani@hp.com> >>> >> --- >>> >> drivers/block/pmem.c | 4 ++-- >>> >> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >> >>> >> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c >>> >> index eabf4a8..095dfaa 100644 >>> >> --- a/drivers/block/pmem.c >>> >> +++ b/drivers/block/pmem.c >>> >> @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res) >>> >> } >>> >> >>> >> /* >>> >> - * Map the memory as non-cachable, as we can't write back the contents >>> >> + * Map the memory as write-through, as we can't write back the contents >>> >> * of the CPU caches in case of a crash. >>> >> */ >>> >> err = -ENOMEM; >>> >> - pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size); >>> >> + pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size); >>> >> if (!pmem->virt_addr) >>> >> goto out_release_region; >>> > >>> > Dan, Ross, what about this one? >>> > >>> > ACK to pick it up as a temporary solution? >>> >>> I see that is_new_memtype_allowed() is updated to disallow some >>> combinations, but the manual seems to imply any mixing of memory types >>> is unsupported. Which worries me even in the current code where we >>> have uncached mappings in the driver, and potentially cached DAX >>> mappings handed out to userspace. >> >> is_new_memtype_allowed() is not to allow some combinations of mixing of >> memory types. When it is allowed, the requested type of ioremap_xxx() >> is changed to match with the existing map type, so that mixing of memory >> types does not happen. > > Yes, but now if the caller was expecting one memory type and gets > another one that is something I think the driver would want to know. > At a minimum I don't think we want to get emails about pmem driver > performance problems when someone's platform is silently degrading WB > to UC for example. > >> DAX uses vm_insert_mixed(), which does not even check the existing map >> type to the physical address. > > Right, I think that's a problem... > >>> A general quibble separate from this patch is that we don't have a way >>> of knowing if ioremap() will reject or change our requested memory >>> type. Shouldn't the driver be explicitly requesting a known valid >>> type in advance? >> >> I agree we need a solution here. >> >>> Lastly we now have the PMEM API patches from Ross out for review where >>> he is assuming cached mappings with non-temporal writes: >>> https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html. >>> This gives us WC semantics on writes which I believe has the nice >>> property of reducing the number of write transactions to memory. >>> Also, the numbers in the paper seem to be assuming DAX operation, but >>> this ioremap_wt() is in the driver and typically behind a file system. >>> Are the numbers relevant to that usage mode? >> >> I have not looked into the Ross's changes yet, but they do not seem to >> replace the use of ioremap_nocache(). If his changes can use WB type >> reliably, yes, we do not need a temporary solution of using ioremap_wt() >> in this driver. > > Hmm, yes you're right, it seems those patches did not change the > implementation to use ioremap_cache()... which happens to not be > implemented on all architectures. I'll take a look. Whoa, there! Why would we use non-temporal stores to WB memory to access persistent memory? I can see two reasons not to: 1. As far as I understand it, non-temporal stores to WT should have almost identical performance. 2. Is there any actual architectural guarantee that it's safe to have a WB mapping that we use like that? By my reading of the manual, MOVNTDQA (as a write to pmem); SFENCE; PCOMMIT; SFENCE on uncached memory should be guaranteed to do a durable write. On the other hand, it's considerably less clear to me that the same sequence to WB memory is safe -- aren't we supposed to stick a CLWB or CLFLUSHOPT in there, too, on WB memory? In other words, is there any case in which MOVNTDQA or similar acting on a WB mapping could result in a dirty cache line? --Andy
On Fri, May 29, 2015 at 11:34 AM, Andy Lutomirski <luto@amacapital.net> wrote: > On Fri, May 29, 2015 at 11:19 AM, Dan Williams <dan.j.williams@intel.com> wrote: >> On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote: >>> On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote: >>>> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote: >>>> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote: >>>> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot >>>> >> write back the contents of the CPU caches in case of a crash. >>>> >> >>>> >> This patch changes to use ioremap_wt(), which provides uncached >>>> >> writes but cached reads, for improving read performance. >>>> >> >>>> >> Signed-off-by: Toshi Kani <toshi.kani@hp.com> >>>> >> --- >>>> >> drivers/block/pmem.c | 4 ++-- >>>> >> 1 file changed, 2 insertions(+), 2 deletions(-) >>>> >> >>>> >> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c >>>> >> index eabf4a8..095dfaa 100644 >>>> >> --- a/drivers/block/pmem.c >>>> >> +++ b/drivers/block/pmem.c >>>> >> @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res) >>>> >> } >>>> >> >>>> >> /* >>>> >> - * Map the memory as non-cachable, as we can't write back the contents >>>> >> + * Map the memory as write-through, as we can't write back the contents >>>> >> * of the CPU caches in case of a crash. >>>> >> */ >>>> >> err = -ENOMEM; >>>> >> - pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size); >>>> >> + pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size); >>>> >> if (!pmem->virt_addr) >>>> >> goto out_release_region; >>>> > >>>> > Dan, Ross, what about this one? >>>> > >>>> > ACK to pick it up as a temporary solution? >>>> >>>> I see that is_new_memtype_allowed() is updated to disallow some >>>> combinations, but the manual seems to imply any mixing of memory types >>>> is unsupported. Which worries me even in the current code where we >>>> have uncached mappings in the driver, and potentially cached DAX >>>> mappings handed out to userspace. >>> >>> is_new_memtype_allowed() is not to allow some combinations of mixing of >>> memory types. When it is allowed, the requested type of ioremap_xxx() >>> is changed to match with the existing map type, so that mixing of memory >>> types does not happen. >> >> Yes, but now if the caller was expecting one memory type and gets >> another one that is something I think the driver would want to know. >> At a minimum I don't think we want to get emails about pmem driver >> performance problems when someone's platform is silently degrading WB >> to UC for example. >> >>> DAX uses vm_insert_mixed(), which does not even check the existing map >>> type to the physical address. >> >> Right, I think that's a problem... >> >>>> A general quibble separate from this patch is that we don't have a way >>>> of knowing if ioremap() will reject or change our requested memory >>>> type. Shouldn't the driver be explicitly requesting a known valid >>>> type in advance? >>> >>> I agree we need a solution here. >>> >>>> Lastly we now have the PMEM API patches from Ross out for review where >>>> he is assuming cached mappings with non-temporal writes: >>>> https://lists.01.org/pipermail/linux-nvdimm/2015-May/000929.html. >>>> This gives us WC semantics on writes which I believe has the nice >>>> property of reducing the number of write transactions to memory. >>>> Also, the numbers in the paper seem to be assuming DAX operation, but >>>> this ioremap_wt() is in the driver and typically behind a file system. >>>> Are the numbers relevant to that usage mode? >>> >>> I have not looked into the Ross's changes yet, but they do not seem to >>> replace the use of ioremap_nocache(). If his changes can use WB type >>> reliably, yes, we do not need a temporary solution of using ioremap_wt() >>> in this driver. >> >> Hmm, yes you're right, it seems those patches did not change the >> implementation to use ioremap_cache()... which happens to not be >> implemented on all architectures. I'll take a look. > > Whoa, there! Why would we use non-temporal stores to WB memory to > access persistent memory? I can see two reasons not to: > > 1. As far as I understand it, non-temporal stores to WT should have > almost identical performance. > > 2. Is there any actual architectural guarantee that it's safe to have > a WB mapping that we use like that? By my reading of the manual, > MOVNTDQA (as a write to pmem); SFENCE; PCOMMIT; SFENCE on uncached > memory should be guaranteed to do a durable write. On the other hand, > it's considerably less clear to me that the same sequence to WB memory > is safe -- aren't we supposed to stick a CLWB or CLFLUSHOPT in there, > too, on WB memory? In other words, is there any case in which > MOVNTDQA or similar acting on a WB mapping could result in a dirty > cache line? Depends, see the note in 10.4.6.2, "Some older CPU implementations (e.g., Pentium M) allowed addresses being written with a non-temporal store instruction to be updated in-place if the memory type was not WC and line was already in the cache." The expectation is that boot_cpu_has(X86_FEATURE_PCOMMIT) is false on such a cpu so we'll fallback to not using non-temporal stores.
On Fri, May 29, 2015 at 11:32 AM, Toshi Kani <toshi.kani@hp.com> wrote: > On Fri, 2015-05-29 at 11:19 -0700, Dan Williams wrote: >> On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote: >> > On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote: >> >> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote: >> >> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote: >> >> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot > : >> >> >> - pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size); >> >> >> + pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size); >> >> >> if (!pmem->virt_addr) >> >> >> goto out_release_region; >> >> > >> >> > Dan, Ross, what about this one? >> >> > >> >> > ACK to pick it up as a temporary solution? >> >> >> >> I see that is_new_memtype_allowed() is updated to disallow some >> >> combinations, but the manual seems to imply any mixing of memory types >> >> is unsupported. Which worries me even in the current code where we >> >> have uncached mappings in the driver, and potentially cached DAX >> >> mappings handed out to userspace. >> > >> > is_new_memtype_allowed() is not to allow some combinations of mixing of >> > memory types. When it is allowed, the requested type of ioremap_xxx() >> > is changed to match with the existing map type, so that mixing of memory >> > types does not happen. >> >> Yes, but now if the caller was expecting one memory type and gets >> another one that is something I think the driver would want to know. >> At a minimum I don't think we want to get emails about pmem driver >> performance problems when someone's platform is silently degrading WB >> to UC for example. > > The pmem driver creates an ioremap map to an NVDIMM range first. So, > there will be no conflict at this point, unless there is a conflicting > driver claiming the same NVDIMM range. Hmm, I thought it would be WB due to this comment in is_new_memtype_allowed() /* * PAT type is always WB for untracked ranges, so no need to check. */
On Fri, 2015-05-29 at 12:34 -0700, Dan Williams wrote: > On Fri, May 29, 2015 at 11:32 AM, Toshi Kani <toshi.kani@hp.com> wrote: > > On Fri, 2015-05-29 at 11:19 -0700, Dan Williams wrote: > >> On Fri, May 29, 2015 at 8:03 AM, Toshi Kani <toshi.kani@hp.com> wrote: > >> > On Fri, 2015-05-29 at 07:43 -0700, Dan Williams wrote: > >> >> On Fri, May 29, 2015 at 2:11 AM, Borislav Petkov <bp@alien8.de> wrote: > >> >> > On Wed, May 27, 2015 at 09:19:04AM -0600, Toshi Kani wrote: > >> >> >> The pmem driver maps NVDIMM with ioremap_nocache() as we cannot > > : > >> >> >> - pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size); > >> >> >> + pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size); > >> >> >> if (!pmem->virt_addr) > >> >> >> goto out_release_region; > >> >> > > >> >> > Dan, Ross, what about this one? > >> >> > > >> >> > ACK to pick it up as a temporary solution? > >> >> > >> >> I see that is_new_memtype_allowed() is updated to disallow some > >> >> combinations, but the manual seems to imply any mixing of memory types > >> >> is unsupported. Which worries me even in the current code where we > >> >> have uncached mappings in the driver, and potentially cached DAX > >> >> mappings handed out to userspace. > >> > > >> > is_new_memtype_allowed() is not to allow some combinations of mixing of > >> > memory types. When it is allowed, the requested type of ioremap_xxx() > >> > is changed to match with the existing map type, so that mixing of memory > >> > types does not happen. > >> > >> Yes, but now if the caller was expecting one memory type and gets > >> another one that is something I think the driver would want to know. > >> At a minimum I don't think we want to get emails about pmem driver > >> performance problems when someone's platform is silently degrading WB > >> to UC for example. > > > > The pmem driver creates an ioremap map to an NVDIMM range first. So, > > there will be no conflict at this point, unless there is a conflicting > > driver claiming the same NVDIMM range. > > Hmm, I thought it would be WB due to this comment in is_new_memtype_allowed() > > /* > * PAT type is always WB for untracked ranges, so no need to check. > */ This comment applies to the ISA range, where ioremap() does not create any mapping to, i.e. untracked. You can ignore this comment for NVDIMM. Thanks, -Toshi
> -----Original Message----- > From: Andy Lutomirski [mailto:luto@amacapital.net] > Sent: Friday, May 29, 2015 1:35 PM ... > Whoa, there! Why would we use non-temporal stores to WB memory to > access persistent memory? I can see two reasons not to: Data written to a block storage device (here, the NVDIMM) is unlikely to be read or written again any time soon. It's not like the code and data that a program has in memory, where there might be a loop accessing the location every CPU clock; it's storage I/O to historically very slow (relative to the CPU clock speed) devices. The source buffer for that data might be frequently accessed, but not the NVDIMM storage itself. Non-temporal stores avoid wasting cache space on these "one-time" accesses. The same applies for reads and non-temporal loads. Keep the CPU data cache lines free for the application. DAX and mmap() do change that; the application is now free to store frequently accessed data structures directly in persistent memory. But, that's not available if btt is used, and application loads and stores won't go through the memcpy() calls inside pmem anyway. The non-temporal instructions are cache coherent, so data integrity won't get confused by them if I/O going through pmem's block storage APIs happens to overlap with the application's mmap() regions. --- Robert Elliott, HP Server Storage
On Fri, May 29, 2015 at 2:29 PM, Elliott, Robert (Server Storage) <Elliott@hp.com> wrote: >> -----Original Message----- >> From: Andy Lutomirski [mailto:luto@amacapital.net] >> Sent: Friday, May 29, 2015 1:35 PM > ... >> Whoa, there! Why would we use non-temporal stores to WB memory to >> access persistent memory? I can see two reasons not to: > > Data written to a block storage device (here, the NVDIMM) is unlikely > to be read or written again any time soon. It's not like the code > and data that a program has in memory, where there might be a loop > accessing the location every CPU clock; it's storage I/O to > historically very slow (relative to the CPU clock speed) devices. > The source buffer for that data might be frequently accessed, > but not the NVDIMM storage itself. > > Non-temporal stores avoid wasting cache space on these "one-time" > accesses. The same applies for reads and non-temporal loads. > Keep the CPU data cache lines free for the application. > > DAX and mmap() do change that; the application is now free to > store frequently accessed data structures directly in persistent > memory. But, that's not available if btt is used, and > application loads and stores won't go through the memcpy() > calls inside pmem anyway. The non-temporal instructions are > cache coherent, so data integrity won't get confused by them > if I/O going through pmem's block storage APIs happens > to overlap with the application's mmap() regions. > You answered the wrong question. :) I understand the point of the non-temporal stores -- I don't understand the point of using non-temporal stores to *WB memory*. I think we should be okay with having the kernel mapping use WT instead. --Andy
--- Robert Elliott, HP Server Storage > -----Original Message----- > From: Andy Lutomirski [mailto:luto@amacapital.net] > Sent: Friday, May 29, 2015 4:46 PM > To: Elliott, Robert (Server Storage) > Cc: Dan Williams; Kani, Toshimitsu; Borislav Petkov; Ross Zwisler; > H. Peter Anvin; Thomas Gleixner; Ingo Molnar; Andrew Morton; Arnd > Bergmann; linux-mm@kvack.org; linux-kernel@vger.kernel.org; X86 ML; > linux-nvdimm@lists.01.org; Juergen Gross; Stefan Bader; Henrique de > Moraes Holschuh; Yigal Korman; Konrad Rzeszutek Wilk; Luis > Rodriguez; Christoph Hellwig; Matthew Wilcox > Subject: Re: [PATCH v10 12/12] drivers/block/pmem: Map NVDIMM with > ioremap_wt() > > On Fri, May 29, 2015 at 2:29 PM, Elliott, Robert (Server Storage) > <Elliott@hp.com> wrote: > >> -----Original Message----- > >> From: Andy Lutomirski [mailto:luto@amacapital.net] > >> Sent: Friday, May 29, 2015 1:35 PM > > ... > >> Whoa, there! Why would we use non-temporal stores to WB memory > to > >> access persistent memory? I can see two reasons not to: > > > > Data written to a block storage device (here, the NVDIMM) is > unlikely > > to be read or written again any time soon. It's not like the code > > and data that a program has in memory, where there might be a loop > > accessing the location every CPU clock; it's storage I/O to > > historically very slow (relative to the CPU clock speed) devices. > > The source buffer for that data might be frequently accessed, > > but not the NVDIMM storage itself. > > > > Non-temporal stores avoid wasting cache space on these "one-time" > > accesses. The same applies for reads and non-temporal loads. > > Keep the CPU data cache lines free for the application. > > > > DAX and mmap() do change that; the application is now free to > > store frequently accessed data structures directly in persistent > > memory. But, that's not available if btt is used, and > > application loads and stores won't go through the memcpy() > > calls inside pmem anyway. The non-temporal instructions are > > cache coherent, so data integrity won't get confused by them > > if I/O going through pmem's block storage APIs happens > > to overlap with the application's mmap() regions. > > > > You answered the wrong question. :) I understand the point of the > non-temporal stores -- I don't understand the point of using > non-temporal stores to *WB memory*. I think we should be okay with > having the kernel mapping use WT instead. The cache type that the application chooses for its mmap() view has to be compatible with that already selected by the kernel, or we run into: Intel SDM 11.12.4 Programming the PAT ... "The PAT allows any memory type to be specified in the page tables, and therefore it is possible to have a single physical page mapped to two or more different linear addresses, each with different memory types. Intel does not support this practice because it may lead to undefined operations that can result in a system failure. In particular, a WC page must never be aliased to a cacheable page because WC writes may not check the processor caches." Right now, application memory is always WB, so WB is the only safe choice from this perspective (the system must have ADR for safety from other perspectives). That might not be the best choice for all applications, though; some applications might not want CPU caching all the data they run through here and prefer WC. On a non-ADR system, WT might be the only safe choice. Should there be a way for the application to specify a cache type in its mmap() call? The type already selected by the kernel driver could (carefully) be changed on the fly if it's different. Non-temporal store performance is excellent under WB, WC, and WT; if anything, I think WC edges ahead because it need not snoop the cache. It's still poor under UC.
Nontemporal stores to WB memory is fine in such a way that it doesn't pollute the cache. This can be done by denoting to WC or by forcing cache allocation out of only a subset of the cache. On May 29, 2015 2:46:19 PM PDT, Andy Lutomirski <luto@amacapital.net> wrote: >On Fri, May 29, 2015 at 2:29 PM, Elliott, Robert (Server Storage) ><Elliott@hp.com> wrote: >>> -----Original Message----- >>> From: Andy Lutomirski [mailto:luto@amacapital.net] >>> Sent: Friday, May 29, 2015 1:35 PM >> ... >>> Whoa, there! Why would we use non-temporal stores to WB memory to >>> access persistent memory? I can see two reasons not to: >> >> Data written to a block storage device (here, the NVDIMM) is unlikely >> to be read or written again any time soon. It's not like the code >> and data that a program has in memory, where there might be a loop >> accessing the location every CPU clock; it's storage I/O to >> historically very slow (relative to the CPU clock speed) devices. >> The source buffer for that data might be frequently accessed, >> but not the NVDIMM storage itself. >> >> Non-temporal stores avoid wasting cache space on these "one-time" >> accesses. The same applies for reads and non-temporal loads. >> Keep the CPU data cache lines free for the application. >> >> DAX and mmap() do change that; the application is now free to >> store frequently accessed data structures directly in persistent >> memory. But, that's not available if btt is used, and >> application loads and stores won't go through the memcpy() >> calls inside pmem anyway. The non-temporal instructions are >> cache coherent, so data integrity won't get confused by them >> if I/O going through pmem's block storage APIs happens >> to overlap with the application's mmap() regions. >> > >You answered the wrong question. :) I understand the point of the >non-temporal stores -- I don't understand the point of using >non-temporal stores to *WB memory*. I think we should be okay with >having the kernel mapping use WT instead. > >--Andy
* Andy Lutomirski <luto@amacapital.net> wrote: > You answered the wrong question. :) I understand the point of the non-temporal > stores -- I don't understand the point of using non-temporal stores to *WB > memory*. I think we should be okay with having the kernel mapping use WT > instead. WB memory is write-through, but they are still fully cached for reads. So non-temporal instructions influence how the CPU will allocate (or not allocate) WT cache lines. Thanks, Ingo
On Mon, Jun 1, 2015 at 1:58 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Andy Lutomirski <luto@amacapital.net> wrote: > >> You answered the wrong question. :) I understand the point of the non-temporal >> stores -- I don't understand the point of using non-temporal stores to *WB >> memory*. I think we should be okay with having the kernel mapping use WT >> instead. > > WB memory is write-through, but they are still fully cached for reads. > > So non-temporal instructions influence how the CPU will allocate (or not allocate) > WT cache lines. > I'm doing a terrible job of saying what I mean. Given that we're using non-temporal writes, the kernel code should work correctly and with similar performance regardless of whether the mapping is WB or WT. It would still be correct, if slower, with WC or UC, and, if we used explicit streaming reads, even that would matter less. I think this means that we are free to switch the kernel mapping between WB and WT as needed to improve DAX behavior. We could even plausibly do it at runtime. --Andy > Thanks, > > Ingo
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c index eabf4a8..095dfaa 100644 --- a/drivers/block/pmem.c +++ b/drivers/block/pmem.c @@ -139,11 +139,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res) } /* - * Map the memory as non-cachable, as we can't write back the contents + * Map the memory as write-through, as we can't write back the contents * of the CPU caches in case of a crash. */ err = -ENOMEM; - pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size); + pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size); if (!pmem->virt_addr) goto out_release_region;
The pmem driver maps NVDIMM with ioremap_nocache() as we cannot write back the contents of the CPU caches in case of a crash. This patch changes to use ioremap_wt(), which provides uncached writes but cached reads, for improving read performance. Signed-off-by: Toshi Kani <toshi.kani@hp.com> --- drivers/block/pmem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)