Message ID | CAPcyv4he0q_FdqqiXarp0bXjcggs8QZX8Od560E2iFxzCU3Qag@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] device-dax for 5.1: PMEM as RAM | expand |
On Sun, Mar 10, 2019 at 12:54 PM Dan Williams <dan.j.williams@intel.com> wrote: > > Hi Linus, please pull from: > > git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm > tags/devdax-for-5.1 > > ...to receive new device-dax infrastructure to allow persistent memory > and other "reserved" / performance differentiated memories, to be > assigned to the core-mm as "System RAM". I'm not pulling this until I get official Intel clarification on the whole "pmem vs rep movs vs machine check" behavior. Last I saw it was deadly and didn't work, and we have a whole "mc-safe memory copy" thing for it in the kernel because repeat string instructions didn't work correctly on nvmem. No way am I exposing any users to something like that. We need a way to know when it works and when it doesn't, and only do it when it's safe. Linus
[ add Tony, who has wrestled with how to detect rep; movs recover-ability ] On Sun, Mar 10, 2019 at 1:02 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sun, Mar 10, 2019 at 12:54 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > Hi Linus, please pull from: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm > > tags/devdax-for-5.1 > > > > ...to receive new device-dax infrastructure to allow persistent memory > > and other "reserved" / performance differentiated memories, to be > > assigned to the core-mm as "System RAM". > > I'm not pulling this until I get official Intel clarification on the > whole "pmem vs rep movs vs machine check" behavior. > > Last I saw it was deadly and didn't work, and we have a whole "mc-safe > memory copy" thing for it in the kernel because repeat string > instructions didn't work correctly on nvmem. > > No way am I exposing any users to something like that. > > We need a way to know when it works and when it doesn't, and only do > it when it's safe. Unfortunately this particular b0rkage is not constrained to nvmem. I.e. there's nothing specific about nvmem requiring mc-safe memory copy, it's a cpu problem consuming any poison regardless of source-media-type with "rep; movs".
On Sun, Mar 10, 2019 at 4:54 PM Dan Williams <dan.j.williams@intel.com> wrote: > > Unfortunately this particular b0rkage is not constrained to nvmem. > I.e. there's nothing specific about nvmem requiring mc-safe memory > copy, it's a cpu problem consuming any poison regardless of > source-media-type with "rep; movs". So why is it sold and used for the nvdimm pmem driver? People told me it was a big deal and machines died. You can't suddenly change the story just because you want to expose it to user space. You can't have it both ways. Either nvdimms have more likelihood of, and problems with, machine checks, or it doesn't. The end result is the same: if intel believes the kernel needs to treat nvdimms specially, then we're sure as hell not exposing those snowflakes to user space. And if intel *doesn't* believe that, then we're removing the mcsafe_* functions. There's no "oh, it's safe to show to user space, but the kernel is magical" middle ground here that makes sense to me. Linus
On Sun, Mar 10, 2019 at 5:22 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sun, Mar 10, 2019 at 4:54 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > Unfortunately this particular b0rkage is not constrained to nvmem. > > I.e. there's nothing specific about nvmem requiring mc-safe memory > > copy, it's a cpu problem consuming any poison regardless of > > source-media-type with "rep; movs". > > So why is it sold and used for the nvdimm pmem driver? > > People told me it was a big deal and machines died. > > You can't suddenly change the story just because you want to expose it > to user space. > > You can't have it both ways. Either nvdimms have more likelihood of, > and problems with, machine checks, or it doesn't. > > The end result is the same: if intel believes the kernel needs to > treat nvdimms specially, then we're sure as hell not exposing those > snowflakes to user space. > > And if intel *doesn't* believe that, then we're removing the mcsafe_* functions. > > There's no "oh, it's safe to show to user space, but the kernel is > magical" middle ground here that makes sense to me. I don't think anyone is trying to claim both ways... the mcsafe memcpy is not implemented because NVDIMMs have a higher chance of encountering poison, it's implemented because the pmem driver affords an error model that just isn't possible in other kernel poison consumption paths. Even if this issue didn't exist there would still be a rep; mov based mcsafe memcpy for the driver to use on the expectation that userspace would prefer EIO to a reboot for kernel-space consumed poison. That said, I agree with the argument that a kernel mcsafe copy is not sufficient when DAX is there to arrange for the bulk of memory-mapped-I/O to be issued from userspace. Another feature the userspace tooling can support for the PMEM as RAM case is the ability to complete an Address Range Scrub of the range before it is added to the core-mm. I.e at least ensure that previously encountered poison is eliminated. The driver can also publish an attribute to indicate when rep; mov is recoverable, and gate the hotplug policy on the result. In my opinion a positive indicator of the cpu's ability to recover rep; mov exceptions is a gap that needs addressing.
On Mon, Mar 11, 2019 at 8:37 AM Dan Williams <dan.j.williams@intel.com> wrote: > > Another feature the userspace tooling can support for the PMEM as RAM > case is the ability to complete an Address Range Scrub of the range > before it is added to the core-mm. I.e at least ensure that previously > encountered poison is eliminated. Ok, so this at least makes sense as an argument to me. In the "PMEM as filesystem" part, the errors have long-term history, while in "PMEM as RAM" the memory may be physically the same thing, but it doesn't have the history and as such may not be prone to long-term errors the same way. So that validly argues that yes, when used as RAM, the likelihood for errors is much lower because they don't accumulate the same way. > The driver can also publish an > attribute to indicate when rep; mov is recoverable, and gate the > hotplug policy on the result. In my opinion a positive indicator of > the cpu's ability to recover rep; mov exceptions is a gap that needs > addressing. Is there some way to say "don't raise MC for this region"? Or at least limit it to a nonfatal one? Linus
On Mon, Mar 11, 2019 at 5:08 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, Mar 11, 2019 at 8:37 AM Dan Williams <dan.j.williams@intel.com> wrote: > > > > Another feature the userspace tooling can support for the PMEM as RAM > > case is the ability to complete an Address Range Scrub of the range > > before it is added to the core-mm. I.e at least ensure that previously > > encountered poison is eliminated. > > Ok, so this at least makes sense as an argument to me. > > In the "PMEM as filesystem" part, the errors have long-term history, > while in "PMEM as RAM" the memory may be physically the same thing, > but it doesn't have the history and as such may not be prone to > long-term errors the same way. > > So that validly argues that yes, when used as RAM, the likelihood for > errors is much lower because they don't accumulate the same way. > > > The driver can also publish an > > attribute to indicate when rep; mov is recoverable, and gate the > > hotplug policy on the result. In my opinion a positive indicator of > > the cpu's ability to recover rep; mov exceptions is a gap that needs > > addressing. > > Is there some way to say "don't raise MC for this region"? Or at least > limit it to a nonfatal one? I wish, but no. The poison consumption always raises the MC then it's whether MCI_STATUS_PCC (processor context corrupt) is set as to whether the cpu indicates it is safe to proceed. There's no way to indicate, "never set MCI_STATUS_PCC", or silence the exception.
On Mon, Mar 11, 2019 at 5:08 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, Mar 11, 2019 at 8:37 AM Dan Williams <dan.j.williams@intel.com> wrote: > > > > Another feature the userspace tooling can support for the PMEM as RAM > > case is the ability to complete an Address Range Scrub of the range > > before it is added to the core-mm. I.e at least ensure that previously > > encountered poison is eliminated. > > Ok, so this at least makes sense as an argument to me. > > In the "PMEM as filesystem" part, the errors have long-term history, > while in "PMEM as RAM" the memory may be physically the same thing, > but it doesn't have the history and as such may not be prone to > long-term errors the same way. > > So that validly argues that yes, when used as RAM, the likelihood for > errors is much lower because they don't accumulate the same way. Hi Linus, The question about a new enumeration mechanism for this has been raised, but I don't expect a response before the merge window closes. While it percolates, how do you want to proceed in the meantime? The kernel could export it's knowledge of the situation in /sys/devices/system/cpu/vulnerabilities? Otherwise, the exposure can be reduced in the volatile-RAM case by scanning for and clearing errors before it is onlined as RAM. The userspace tooling for that can be in place before v5.1-final. There's also runtime notifications of errors via acpi_nfit_uc_error_notify() from background scrubbers on the DIMM devices. With that mechanism the kernel could proactively clear newly discovered poison in the volatile case, but that would be additional development more suitable for v5.2. I understand the concern, and the need to highlight this issue by tapping the brakes on feature development, but I don't see PMEM as RAM making the situation worse when the exposure is also there via DAX in the PMEM case. Volatile-RAM is arguably a safer use case since it's possible to repair pages where the persistent case needs active application coordination. Please take another look at merging this for v5.1, or otherwise let me know what software changes you'd like to see to move this forward. I'm also open to the idea of just teaching memcpy_mcsafe() to use rep; mov as if it was always recoverable and relying on the error being mapped out after reboot if it was not recoverable. At reboot the driver gets notification of physical addresses that caused a previous crash so that software can avoid a future consumption. git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/devdax-for-5.1
The pull request you sent on Sun, 10 Mar 2019 12:54:01 -0700:
> git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/devdax-for-5.1
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/f67e3fb4891287b8248ebb3320f794b9f5e782d4
Thank you!
On Mon, Mar 11, 2019 at 5:08 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, Mar 11, 2019 at 8:37 AM Dan Williams <dan.j.williams@intel.com> wrote: > > > > Another feature the userspace tooling can support for the PMEM as RAM > > case is the ability to complete an Address Range Scrub of the range > > before it is added to the core-mm. I.e at least ensure that previously > > encountered poison is eliminated. > > Ok, so this at least makes sense as an argument to me. > > In the "PMEM as filesystem" part, the errors have long-term history, > while in "PMEM as RAM" the memory may be physically the same thing, > but it doesn't have the history and as such may not be prone to > long-term errors the same way. > > So that validly argues that yes, when used as RAM, the likelihood for > errors is much lower because they don't accumulate the same way. In case anyone is looking for the above mentioned tooling for use with the v5.1 kernel, Vishal has released ndctl-v65 with the new "clear-errors" command [1]. [1]: https://pmem.io/ndctl/ndctl-clear-errors.html