Message ID | 1451388711-18646-5-git-send-email-haozhong.zhang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
>>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote: > NVDIMM devices are detected and configured by software through > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This > patch extends the existing mechanism in hvmloader of loading passthrough > ACPI tables to load extra ACPI tables built by QEMU. Mechanically the patch looks okay, but whether it's actually needed depends on whether indeed we want NV RAM managed in qemu instead of in the hypervisor (where imo it belongs); I didn' see any reply yet to that same comment of mine made (iirc) in the context of another patch. Jan
On 01/15/16 10:10, Jan Beulich wrote: > >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote: > > NVDIMM devices are detected and configured by software through > > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This > > patch extends the existing mechanism in hvmloader of loading passthrough > > ACPI tables to load extra ACPI tables built by QEMU. > > Mechanically the patch looks okay, but whether it's actually needed > depends on whether indeed we want NV RAM managed in qemu > instead of in the hypervisor (where imo it belongs); I didn' see any > reply yet to that same comment of mine made (iirc) in the context > of another patch. > > Jan > One purpose of this patch series is to provide vNVDIMM backed by host NVDIMM devices. It requires some drivers to detect and manage host NVDIMM devices (including parsing ACPI, managing labels, etc.) that are not trivial, so I leave this work to the dom0 linux. Current Linux kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU then mmaps them into certain range of dom0's address space and asks Xen hypervisor to map that range of address space to a domU. However, there are two problems in this Xen patch series and the corresponding QEMU patch series, which may require further changes in hypervisor and/or toolstack. (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map the host NVDIMM to domU, which results VMEXIT for every guest read/write to the corresponding vNVDIMM devices. I'm going to find a way to passthrough the address space range of host NVDIMM to a guest domU (similarly to what xen-pt in QEMU uses) (2) Xen currently does not check whether the address that QEMU asks to map to domU is really within the host NVDIMM address space. Therefore, Xen hypervisor needs a way to decide the host NVDIMM address space which can be done by parsing ACPI NFIT tables. Haozhong
>>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote: > On 01/15/16 10:10, Jan Beulich wrote: >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote: >> > NVDIMM devices are detected and configured by software through >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This >> > patch extends the existing mechanism in hvmloader of loading passthrough >> > ACPI tables to load extra ACPI tables built by QEMU. >> >> Mechanically the patch looks okay, but whether it's actually needed >> depends on whether indeed we want NV RAM managed in qemu >> instead of in the hypervisor (where imo it belongs); I didn' see any >> reply yet to that same comment of mine made (iirc) in the context >> of another patch. > > One purpose of this patch series is to provide vNVDIMM backed by host > NVDIMM devices. It requires some drivers to detect and manage host > NVDIMM devices (including parsing ACPI, managing labels, etc.) that > are not trivial, so I leave this work to the dom0 linux. Current Linux > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU > then mmaps them into certain range of dom0's address space and asks > Xen hypervisor to map that range of address space to a domU. > > However, there are two problems in this Xen patch series and the > corresponding QEMU patch series, which may require further > changes in hypervisor and/or toolstack. > > (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map > the host NVDIMM to domU, which results VMEXIT for every guest > read/write to the corresponding vNVDIMM devices. I'm going to find > a way to passthrough the address space range of host NVDIMM to a > guest domU (similarly to what xen-pt in QEMU uses) > > (2) Xen currently does not check whether the address that QEMU asks to > map to domU is really within the host NVDIMM address > space. Therefore, Xen hypervisor needs a way to decide the host > NVDIMM address space which can be done by parsing ACPI NFIT > tables. These problems are a pretty direct result of the management of NVDIMM not being done by the hypervisor. Stating what qemu currently does is, I'm afraid, not really serving the purpose of hashing out whether the management of NVDIMM, just like that of "normal" RAM, wouldn't better be done by the hypervisor. In fact so far I haven't seen any rationale (other than the desire to share code with KVM) for the presently chosen solution. Yet in KVM qemu is - afaict - much more of an integral part of the hypervisor than it is in the Xen case (and even there core management of the memory is left to the kernel, i.e. what constitutes the core hypervisor there). Jan
On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote: > >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote: > > On 01/15/16 10:10, Jan Beulich wrote: > >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote: > >> > NVDIMM devices are detected and configured by software through > >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This > >> > patch extends the existing mechanism in hvmloader of loading passthrough > >> > ACPI tables to load extra ACPI tables built by QEMU. > >> > >> Mechanically the patch looks okay, but whether it's actually needed > >> depends on whether indeed we want NV RAM managed in qemu > >> instead of in the hypervisor (where imo it belongs); I didn' see any > >> reply yet to that same comment of mine made (iirc) in the context > >> of another patch. > > > > One purpose of this patch series is to provide vNVDIMM backed by host > > NVDIMM devices. It requires some drivers to detect and manage host > > NVDIMM devices (including parsing ACPI, managing labels, etc.) that > > are not trivial, so I leave this work to the dom0 linux. Current Linux > > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU > > then mmaps them into certain range of dom0's address space and asks > > Xen hypervisor to map that range of address space to a domU. > > OOI Do we have a viable solution to do all these non-trivial things in core hypervisor? Are you proposing designing a new set of hypercalls for NVDIMM? Wei.
>>> On 19.01.16 at 12:37, <wei.liu2@citrix.com> wrote: > On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote: >> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote: >> > On 01/15/16 10:10, Jan Beulich wrote: >> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote: >> >> > NVDIMM devices are detected and configured by software through >> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This >> >> > patch extends the existing mechanism in hvmloader of loading passthrough >> >> > ACPI tables to load extra ACPI tables built by QEMU. >> >> >> >> Mechanically the patch looks okay, but whether it's actually needed >> >> depends on whether indeed we want NV RAM managed in qemu >> >> instead of in the hypervisor (where imo it belongs); I didn' see any >> >> reply yet to that same comment of mine made (iirc) in the context >> >> of another patch. >> > >> > One purpose of this patch series is to provide vNVDIMM backed by host >> > NVDIMM devices. It requires some drivers to detect and manage host >> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that >> > are not trivial, so I leave this work to the dom0 linux. Current Linux >> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU >> > then mmaps them into certain range of dom0's address space and asks >> > Xen hypervisor to map that range of address space to a domU. >> > > > OOI Do we have a viable solution to do all these non-trivial things in > core hypervisor? Are you proposing designing a new set of hypercalls > for NVDIMM? That's certainly a possibility; I lack sufficient detail to make myself an opinion which route is going to be best. Jan
> From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Tuesday, January 19, 2016 7:47 PM > > >>> On 19.01.16 at 12:37, <wei.liu2@citrix.com> wrote: > > On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote: > >> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote: > >> > On 01/15/16 10:10, Jan Beulich wrote: > >> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote: > >> >> > NVDIMM devices are detected and configured by software through > >> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This > >> >> > patch extends the existing mechanism in hvmloader of loading passthrough > >> >> > ACPI tables to load extra ACPI tables built by QEMU. > >> >> > >> >> Mechanically the patch looks okay, but whether it's actually needed > >> >> depends on whether indeed we want NV RAM managed in qemu > >> >> instead of in the hypervisor (where imo it belongs); I didn' see any > >> >> reply yet to that same comment of mine made (iirc) in the context > >> >> of another patch. > >> > > >> > One purpose of this patch series is to provide vNVDIMM backed by host > >> > NVDIMM devices. It requires some drivers to detect and manage host > >> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that > >> > are not trivial, so I leave this work to the dom0 linux. Current Linux > >> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU > >> > then mmaps them into certain range of dom0's address space and asks > >> > Xen hypervisor to map that range of address space to a domU. > >> > > > > > OOI Do we have a viable solution to do all these non-trivial things in > > core hypervisor? Are you proposing designing a new set of hypercalls > > for NVDIMM? > > That's certainly a possibility; I lack sufficient detail to make myself > an opinion which route is going to be best. > > Jan Hi, Haozhong, Are NVDIMM related ACPI table in plain text format, or do they require a ACPI parser to decode? Is there a corresponding E820 entry? Above information would be useful to help decide the direction. In a glimpse I like Jan's idea that it's better to let Xen manage NVDIMM since it's a type of memory resource while for memory we expect hypervisor to centrally manage. However in another thought the answer is different if we view this resource as a MMIO resource, similar to PCI BAR MMIO, ACPI NVS, etc. then it should be fine to have Dom0 manage NVDIMM then Xen just controls the mapping based on existing io permission mechanism. Another possible point for this model is that PMEM is only one mode of NVDIMM device, which can be also exposed as a storage device. In the latter case the management has to be in Dom0. So we don't need to scatter the management role into Dom0/Xen based on different modes. Back to your earlier questions: > (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map > the host NVDIMM to domU, which results VMEXIT for every guest > read/write to the corresponding vNVDIMM devices. I'm going to find > a way to passthrough the address space range of host NVDIMM to a > guest domU (similarly to what xen-pt in QEMU uses) > > (2) Xen currently does not check whether the address that QEMU asks to > map to domU is really within the host NVDIMM address > space. Therefore, Xen hypervisor needs a way to decide the host > NVDIMM address space which can be done by parsing ACPI NFIT > tables. If you look at how ACPI OpRegion is handled for IGD passthrough: 241 ret = xc_domain_iomem_permission(xen_xc, xen_domid, 242 (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT), 243 XEN_PCI_INTEL_OPREGION_PAGES, 244 XEN_PCI_INTEL_OPREGION_ENABLE_ACCESSED); 254 ret = xc_domain_memory_mapping(xen_xc, xen_domid, 255 (unsigned long)(igd_guest_opregion >> XC_PAGE_SHIFT), 256 (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT), 257 XEN_PCI_INTEL_OPREGION_PAGES, 258 DPCI_ADD_MAPPING); Above can address your 2 questions. Xen doesn't need to tell exactly whether the assigned range actually belongs to NVDIMM, just like the policy for PCI assignment today. Thanks Kevin
Hi Jan, Wei and Kevin, On 01/18/16 01:46, Jan Beulich wrote: > >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote: > > On 01/15/16 10:10, Jan Beulich wrote: > >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote: > >> > NVDIMM devices are detected and configured by software through > >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This > >> > patch extends the existing mechanism in hvmloader of loading passthrough > >> > ACPI tables to load extra ACPI tables built by QEMU. > >> > >> Mechanically the patch looks okay, but whether it's actually needed > >> depends on whether indeed we want NV RAM managed in qemu > >> instead of in the hypervisor (where imo it belongs); I didn' see any > >> reply yet to that same comment of mine made (iirc) in the context > >> of another patch. > > > > One purpose of this patch series is to provide vNVDIMM backed by host > > NVDIMM devices. It requires some drivers to detect and manage host > > NVDIMM devices (including parsing ACPI, managing labels, etc.) that > > are not trivial, so I leave this work to the dom0 linux. Current Linux > > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU > > then mmaps them into certain range of dom0's address space and asks > > Xen hypervisor to map that range of address space to a domU. > > > > However, there are two problems in this Xen patch series and the > > corresponding QEMU patch series, which may require further > > changes in hypervisor and/or toolstack. > > > > (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map > > the host NVDIMM to domU, which results VMEXIT for every guest > > read/write to the corresponding vNVDIMM devices. I'm going to find > > a way to passthrough the address space range of host NVDIMM to a > > guest domU (similarly to what xen-pt in QEMU uses) > > > > (2) Xen currently does not check whether the address that QEMU asks to > > map to domU is really within the host NVDIMM address > > space. Therefore, Xen hypervisor needs a way to decide the host > > NVDIMM address space which can be done by parsing ACPI NFIT > > tables. > > These problems are a pretty direct result of the management of > NVDIMM not being done by the hypervisor. > > Stating what qemu currently does is, I'm afraid, not really serving > the purpose of hashing out whether the management of NVDIMM, > just like that of "normal" RAM, wouldn't better be done by the > hypervisor. In fact so far I haven't seen any rationale (other than > the desire to share code with KVM) for the presently chosen > solution. Yet in KVM qemu is - afaict - much more of an integral part > of the hypervisor than it is in the Xen case (and even there core > management of the memory is left to the kernel, i.e. what > constitutes the core hypervisor there). > > Jan > Sorry for the later reply, as I was reading some code and trying to get things clear for myself. The primary reason of current solution is to reuse existing NVDIMM driver in Linux kernel. One responsibility of this driver is to discover NVDIMM devices and their parameters (e.g. which portion of an NVDIMM device can be mapped into the system address space and which address it is mapped to) by parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of ACPI Specification v6 and the actual code in Linux kernel (drivers/acpi/nfit.*), it's not a trivial task. Secondly, the driver implements a convenient block device interface to let software access areas where NVDIMM devices are mapped. The existing vNVDIMM implementation in QEMU uses this interface. As Linux NVDIMM driver has already done above, why do we bother to reimplement them in Xen? For the two problems raised in my previous reply, following are my thoughts. (1) (for the first problem) QEMU mmaps /dev/pmemXX into its virtual address space. When it works with KVM, it calls KVM api to map that virtual address space range into a guest physical address space. For Xen, I'm going to do the similar thing, but Xen seems not provide such api. The most close one I can find is XEN_DOMCTL_memory_mapping (which is used by VGA passthrough in QEMU xen_pt_graphics), but it does not accept guest virtual address. Thus, I'm going to add a new one that does similar work but can accept guest virtual address. (2) (for the second problem) After having looked at the corresponding Linux kernel code and my comments at beginning, I now doubt if it's necessary to parsing NFIT in Xen. Maybe I can follow what xen_pt_graphics does, that is to assign guest with permission to access the corresponding host NVDIMM address space range and then call the new hypercall added in (1). Again, a new hypercall that is similar to XEN_DOMCTL_iomem_permission and can accept guest virtual address is needed. Any comments? Thanks, Haozhong
On 01/20/16 13:14, Tian, Kevin wrote: > > From: Jan Beulich [mailto:JBeulich@suse.com] > > Sent: Tuesday, January 19, 2016 7:47 PM > > > > >>> On 19.01.16 at 12:37, <wei.liu2@citrix.com> wrote: > > > On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote: > > >> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote: > > >> > On 01/15/16 10:10, Jan Beulich wrote: > > >> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote: > > >> >> > NVDIMM devices are detected and configured by software through > > >> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This > > >> >> > patch extends the existing mechanism in hvmloader of loading passthrough > > >> >> > ACPI tables to load extra ACPI tables built by QEMU. > > >> >> > > >> >> Mechanically the patch looks okay, but whether it's actually needed > > >> >> depends on whether indeed we want NV RAM managed in qemu > > >> >> instead of in the hypervisor (where imo it belongs); I didn' see any > > >> >> reply yet to that same comment of mine made (iirc) in the context > > >> >> of another patch. > > >> > > > >> > One purpose of this patch series is to provide vNVDIMM backed by host > > >> > NVDIMM devices. It requires some drivers to detect and manage host > > >> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that > > >> > are not trivial, so I leave this work to the dom0 linux. Current Linux > > >> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU > > >> > then mmaps them into certain range of dom0's address space and asks > > >> > Xen hypervisor to map that range of address space to a domU. > > >> > > > > > > > OOI Do we have a viable solution to do all these non-trivial things in > > > core hypervisor? Are you proposing designing a new set of hypercalls > > > for NVDIMM? > > > > That's certainly a possibility; I lack sufficient detail to make myself > > an opinion which route is going to be best. > > > > Jan > > Hi, Haozhong, > > Are NVDIMM related ACPI table in plain text format, or do they require > a ACPI parser to decode? Is there a corresponding E820 entry? > Most in plain text format, but still the driver evaluates _FIT (firmware interface table) method and decode is needed then. > Above information would be useful to help decide the direction. > > In a glimpse I like Jan's idea that it's better to let Xen manage NVDIMM > since it's a type of memory resource while for memory we expect hypervisor > to centrally manage. > > However in another thought the answer is different if we view this > resource as a MMIO resource, similar to PCI BAR MMIO, ACPI NVS, etc. > then it should be fine to have Dom0 manage NVDIMM then Xen just controls > the mapping based on existing io permission mechanism. > It's more like a MMIO device than the normal ram. > Another possible point for this model is that PMEM is only one mode of > NVDIMM device, which can be also exposed as a storage device. In the > latter case the management has to be in Dom0. So we don't need to > scatter the management role into Dom0/Xen based on different modes. > NVDIMM device in pmem mode is exposed as storage device (a block device /dev/pmemXX) in Linux, and it's also used like a disk drive (you can make file system on it, create files on it and even pass files rather than a whole /dev/pmemXX to guests). > Back to your earlier questions: > > > (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map > > the host NVDIMM to domU, which results VMEXIT for every guest > > read/write to the corresponding vNVDIMM devices. I'm going to find > > a way to passthrough the address space range of host NVDIMM to a > > guest domU (similarly to what xen-pt in QEMU uses) > > > > (2) Xen currently does not check whether the address that QEMU asks to > > map to domU is really within the host NVDIMM address > > space. Therefore, Xen hypervisor needs a way to decide the host > > NVDIMM address space which can be done by parsing ACPI NFIT > > tables. > > If you look at how ACPI OpRegion is handled for IGD passthrough: > > 241 ret = xc_domain_iomem_permission(xen_xc, xen_domid, > 242 (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT), > 243 XEN_PCI_INTEL_OPREGION_PAGES, > 244 XEN_PCI_INTEL_OPREGION_ENABLE_ACCESSED); > > 254 ret = xc_domain_memory_mapping(xen_xc, xen_domid, > 255 (unsigned long)(igd_guest_opregion >> XC_PAGE_SHIFT), > 256 (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT), > 257 XEN_PCI_INTEL_OPREGION_PAGES, > 258 DPCI_ADD_MAPPING); > Yes, I've noticed these two functions. The addition work would be adding new ones that can accept virtual address, as QEMU has no easy way to get the physical address of /dev/pmemXX and can only mmap them into its virtual address space. > Above can address your 2 questions. Xen doesn't need to tell exactly > whether the assigned range actually belongs to NVDIMM, just like > the policy for PCI assignment today. > That means Xen hypervisor can trust whatever address dom0 kernel and QEMU provide? Thanks, Haozhong
>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote: > The primary reason of current solution is to reuse existing NVDIMM > driver in Linux kernel. Re-using code in the Dom0 kernel has benefits and drawbacks, and in any event needs to depend on proper layering to remain in place. A benefit is less code duplication between Xen and Linux; along the same lines a drawback is code duplication between various Dom0 OS variants. > One responsibility of this driver is to discover NVDIMM devices and > their parameters (e.g. which portion of an NVDIMM device can be mapped > into the system address space and which address it is mapped to) by > parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of > ACPI Specification v6 and the actual code in Linux kernel > (drivers/acpi/nfit.*), it's not a trivial task. To answer one of Kevin's questions: The NFIT table doesn't appear to require the ACPI interpreter. They seem more like SRAT and SLIT. Also you failed to answer Kevin's question regarding E820 entries: I think NVDIMM (or at least parts thereof) get represented in E820 (or the EFI memory map), and if that's the case this would be a very strong hint towards management needing to be in the hypervisor. > Secondly, the driver implements a convenient block device interface to > let software access areas where NVDIMM devices are mapped. The > existing vNVDIMM implementation in QEMU uses this interface. > > As Linux NVDIMM driver has already done above, why do we bother to > reimplement them in Xen? See above; a possibility is that we may need a split model (block layer parts on Dom0, "normal memory" parts in the hypervisor. Iirc the split is being determined by firmware, and hence set in stone by the time OS (or hypervisor) boot starts. Jan
On 20/01/2016 08:46, Jan Beulich wrote: >>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote: >> The primary reason of current solution is to reuse existing NVDIMM >> driver in Linux kernel. > Re-using code in the Dom0 kernel has benefits and drawbacks, and > in any event needs to depend on proper layering to remain in place. > A benefit is less code duplication between Xen and Linux; along the > same lines a drawback is code duplication between various Dom0 > OS variants. > >> One responsibility of this driver is to discover NVDIMM devices and >> their parameters (e.g. which portion of an NVDIMM device can be mapped >> into the system address space and which address it is mapped to) by >> parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of >> ACPI Specification v6 and the actual code in Linux kernel >> (drivers/acpi/nfit.*), it's not a trivial task. > To answer one of Kevin's questions: The NFIT table doesn't appear > to require the ACPI interpreter. They seem more like SRAT and SLIT. > Also you failed to answer Kevin's question regarding E820 entries: I > think NVDIMM (or at least parts thereof) get represented in E820 (or > the EFI memory map), and if that's the case this would be a very > strong hint towards management needing to be in the hypervisor. Conceptually, an NVDIMM is just like a fast SSD which is linearly mapped into memory. I am still on the dom0 side of this fence. The real question is whether it is possible to take an NVDIMM, split it in half, give each half to two different guests (with appropriate NFIT tables) and that be sufficient for the guests to just work. Either way, it needs to be a toolstack policy decision as to how to split the resource. ~Andrew > >> Secondly, the driver implements a convenient block device interface to >> let software access areas where NVDIMM devices are mapped. The >> existing vNVDIMM implementation in QEMU uses this interface. >> >> As Linux NVDIMM driver has already done above, why do we bother to >> reimplement them in Xen? > See above; a possibility is that we may need a split model (block > layer parts on Dom0, "normal memory" parts in the hypervisor. > Iirc the split is being determined by firmware, and hence set in > stone by the time OS (or hypervisor) boot starts. > > Jan >
On 01/20/16 08:58, Andrew Cooper wrote: > On 20/01/2016 08:46, Jan Beulich wrote: > >>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote: > >> The primary reason of current solution is to reuse existing NVDIMM > >> driver in Linux kernel. > > Re-using code in the Dom0 kernel has benefits and drawbacks, and > > in any event needs to depend on proper layering to remain in place. > > A benefit is less code duplication between Xen and Linux; along the > > same lines a drawback is code duplication between various Dom0 > > OS variants. > > > >> One responsibility of this driver is to discover NVDIMM devices and > >> their parameters (e.g. which portion of an NVDIMM device can be mapped > >> into the system address space and which address it is mapped to) by > >> parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of > >> ACPI Specification v6 and the actual code in Linux kernel > >> (drivers/acpi/nfit.*), it's not a trivial task. > > To answer one of Kevin's questions: The NFIT table doesn't appear > > to require the ACPI interpreter. They seem more like SRAT and SLIT. > > Also you failed to answer Kevin's question regarding E820 entries: I > > think NVDIMM (or at least parts thereof) get represented in E820 (or > > the EFI memory map), and if that's the case this would be a very > > strong hint towards management needing to be in the hypervisor. > CCing QEMU vNVDIMM maintainer: Xiao Guangrong > Conceptually, an NVDIMM is just like a fast SSD which is linearly mapped > into memory. I am still on the dom0 side of this fence. > > The real question is whether it is possible to take an NVDIMM, split it > in half, give each half to two different guests (with appropriate NFIT > tables) and that be sufficient for the guests to just work. > Yes, one NVDIMM device can be split into multiple parts and assigned to different guests, and QEMU is responsible to maintain virtual NFIT tables for each part. > Either way, it needs to be a toolstack policy decision as to how to > split the resource. > But the split does not need to be done at Xen side IMO. It can be done by dom0 kernel and QEMU as long as they tells Xen hypervisor the address space range of each part. Haozhong > ~Andrew > > > > >> Secondly, the driver implements a convenient block device interface to > >> let software access areas where NVDIMM devices are mapped. The > >> existing vNVDIMM implementation in QEMU uses this interface. > >> > >> As Linux NVDIMM driver has already done above, why do we bother to > >> reimplement them in Xen? > > See above; a possibility is that we may need a split model (block > > layer parts on Dom0, "normal memory" parts in the hypervisor. > > Iirc the split is being determined by firmware, and hence set in > > stone by the time OS (or hypervisor) boot starts. > > > > Jan > > >
Hi, On 01/20/2016 06:15 PM, Haozhong Zhang wrote: > CCing QEMU vNVDIMM maintainer: Xiao Guangrong > >> Conceptually, an NVDIMM is just like a fast SSD which is linearly mapped >> into memory. I am still on the dom0 side of this fence. >> >> The real question is whether it is possible to take an NVDIMM, split it >> in half, give each half to two different guests (with appropriate NFIT >> tables) and that be sufficient for the guests to just work. >> > > Yes, one NVDIMM device can be split into multiple parts and assigned > to different guests, and QEMU is responsible to maintain virtual NFIT > tables for each part. > >> Either way, it needs to be a toolstack policy decision as to how to >> split the resource. Currently, we are using NVDIMM as a block device and a DAX-based filesystem is created upon it in Linux so that file-related accesses directly reach the NVDIMM device. In KVM, If the NVDIMM device need to be shared by different VMs, we can create multiple files on the DAX-based filesystem and assign the file to each VMs. In the future, we can enable namespace (partition-like) for PMEM memory and assign the namespace to each VMs (current Linux driver uses the whole PMEM as a single namespace). I think it is not a easy work to let Xen hypervisor recognize NVDIMM device and manager NVDIMM resource. Thanks!
On 01/20/16 01:46, Jan Beulich wrote: > >>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote: > > The primary reason of current solution is to reuse existing NVDIMM > > driver in Linux kernel. > CC'ing QEMU vNVDIMM maintainer: Xiao Guangrong > Re-using code in the Dom0 kernel has benefits and drawbacks, and > in any event needs to depend on proper layering to remain in place. > A benefit is less code duplication between Xen and Linux; along the > same lines a drawback is code duplication between various Dom0 > OS variants. > Not clear about other Dom0 OS. But for Linux, it already has a NVDIMM driver since 4.2. > > One responsibility of this driver is to discover NVDIMM devices and > > their parameters (e.g. which portion of an NVDIMM device can be mapped > > into the system address space and which address it is mapped to) by > > parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of > > ACPI Specification v6 and the actual code in Linux kernel > > (drivers/acpi/nfit.*), it's not a trivial task. > > To answer one of Kevin's questions: The NFIT table doesn't appear > to require the ACPI interpreter. They seem more like SRAT and SLIT. Sorry, I made a mistake in another reply. NFIT does not contain anything requiring ACPI interpreter. But there are some _DSM methods for NVDIMM in SSDT, which needs ACPI interpreter. > Also you failed to answer Kevin's question regarding E820 entries: I > think NVDIMM (or at least parts thereof) get represented in E820 (or > the EFI memory map), and if that's the case this would be a very > strong hint towards management needing to be in the hypervisor. > Legacy NVDIMM devices may use E820 entries or other ad-hoc ways to announce their locations, but newer ones that follow ACPI v6 spec do not need E820 any more and only need ACPI NFIT (i.e. firmware may not build E820 entries for them). The current linux kernel can handle both legacy and new NVDIMM devices and provide the same block device interface for them. > > Secondly, the driver implements a convenient block device interface to > > let software access areas where NVDIMM devices are mapped. The > > existing vNVDIMM implementation in QEMU uses this interface. > > > > As Linux NVDIMM driver has already done above, why do we bother to > > reimplement them in Xen? > > See above; a possibility is that we may need a split model (block > layer parts on Dom0, "normal memory" parts in the hypervisor. > Iirc the split is being determined by firmware, and hence set in > stone by the time OS (or hypervisor) boot starts. > For the "normal memory" parts, do you mean parts that map the host NVDIMM device's address space range to the guest? I'm going to implement that part in hypervisor and expose it as a hypercall so that it can be used by QEMU. Haozhong
>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote: > On 01/20/16 01:46, Jan Beulich wrote: >> >>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote: >> > Secondly, the driver implements a convenient block device interface to >> > let software access areas where NVDIMM devices are mapped. The >> > existing vNVDIMM implementation in QEMU uses this interface. >> > >> > As Linux NVDIMM driver has already done above, why do we bother to >> > reimplement them in Xen? >> >> See above; a possibility is that we may need a split model (block >> layer parts on Dom0, "normal memory" parts in the hypervisor. >> Iirc the split is being determined by firmware, and hence set in >> stone by the time OS (or hypervisor) boot starts. > > For the "normal memory" parts, do you mean parts that map the host > NVDIMM device's address space range to the guest? I'm going to > implement that part in hypervisor and expose it as a hypercall so that > it can be used by QEMU. To answer this I need to have my understanding of the partitioning being done by firmware confirmed: If that's the case, then "normal" means the part that doesn't get exposed as a block device (SSD). In any event there's no correlation to guest exposure here. Jan
On 20/01/16 10:36, Xiao Guangrong wrote: > > Hi, > > On 01/20/2016 06:15 PM, Haozhong Zhang wrote: > >> CCing QEMU vNVDIMM maintainer: Xiao Guangrong >> >>> Conceptually, an NVDIMM is just like a fast SSD which is linearly >>> mapped >>> into memory. I am still on the dom0 side of this fence. >>> >>> The real question is whether it is possible to take an NVDIMM, split it >>> in half, give each half to two different guests (with appropriate NFIT >>> tables) and that be sufficient for the guests to just work. >>> >> >> Yes, one NVDIMM device can be split into multiple parts and assigned >> to different guests, and QEMU is responsible to maintain virtual NFIT >> tables for each part. >> >>> Either way, it needs to be a toolstack policy decision as to how to >>> split the resource. > > Currently, we are using NVDIMM as a block device and a DAX-based > filesystem > is created upon it in Linux so that file-related accesses directly reach > the NVDIMM device. > > In KVM, If the NVDIMM device need to be shared by different VMs, we can > create multiple files on the DAX-based filesystem and assign the file to > each VMs. In the future, we can enable namespace (partition-like) for > PMEM > memory and assign the namespace to each VMs (current Linux driver uses > the > whole PMEM as a single namespace). > > I think it is not a easy work to let Xen hypervisor recognize NVDIMM > device > and manager NVDIMM resource. > > Thanks! > The more I see about this, the more sure I am that we want to keep it as a block device managed by dom0. In the case of the DAX-based filesystem, I presume files are not necessarily contiguous. I also presume that this is worked around by permuting the mapping of the virtual NVDIMM such that the it appears as a contiguous block of addresses to the guest? Today in Xen, Qemu already has the ability to create mappings in the guest's address space, e.g. to map PCI device BARs. I don't see a conceptual difference here, although the security/permission model certainly is more complicated. ~Andrew
On Wed, 20 Jan 2016, Andrew Cooper wrote: > On 20/01/16 10:36, Xiao Guangrong wrote: > > > > Hi, > > > > On 01/20/2016 06:15 PM, Haozhong Zhang wrote: > > > >> CCing QEMU vNVDIMM maintainer: Xiao Guangrong > >> > >>> Conceptually, an NVDIMM is just like a fast SSD which is linearly > >>> mapped > >>> into memory. I am still on the dom0 side of this fence. > >>> > >>> The real question is whether it is possible to take an NVDIMM, split it > >>> in half, give each half to two different guests (with appropriate NFIT > >>> tables) and that be sufficient for the guests to just work. > >>> > >> > >> Yes, one NVDIMM device can be split into multiple parts and assigned > >> to different guests, and QEMU is responsible to maintain virtual NFIT > >> tables for each part. > >> > >>> Either way, it needs to be a toolstack policy decision as to how to > >>> split the resource. > > > > Currently, we are using NVDIMM as a block device and a DAX-based > > filesystem > > is created upon it in Linux so that file-related accesses directly reach > > the NVDIMM device. > > > > In KVM, If the NVDIMM device need to be shared by different VMs, we can > > create multiple files on the DAX-based filesystem and assign the file to > > each VMs. In the future, we can enable namespace (partition-like) for > > PMEM > > memory and assign the namespace to each VMs (current Linux driver uses > > the > > whole PMEM as a single namespace). > > > > I think it is not a easy work to let Xen hypervisor recognize NVDIMM > > device > > and manager NVDIMM resource. > > > > Thanks! > > > > The more I see about this, the more sure I am that we want to keep it as > a block device managed by dom0. > > In the case of the DAX-based filesystem, I presume files are not > necessarily contiguous. I also presume that this is worked around by > permuting the mapping of the virtual NVDIMM such that the it appears as > a contiguous block of addresses to the guest? > > Today in Xen, Qemu already has the ability to create mappings in the > guest's address space, e.g. to map PCI device BARs. I don't see a > conceptual difference here, although the security/permission model > certainly is more complicated. I imagine that mmap'ing these /dev/pmemXX devices require root privileges, does it not? I wouldn't encourage the introduction of anything else that requires root privileges in QEMU. With QEMU running as non-root by default in 4.7, the feature will not be available unless users explicitly ask to run QEMU as root (which they shouldn't really).
On 01/20/16 13:16, Andrew Cooper wrote: > On 20/01/16 10:36, Xiao Guangrong wrote: > > > > Hi, > > > > On 01/20/2016 06:15 PM, Haozhong Zhang wrote: > > > >> CCing QEMU vNVDIMM maintainer: Xiao Guangrong > >> > >>> Conceptually, an NVDIMM is just like a fast SSD which is linearly > >>> mapped > >>> into memory. I am still on the dom0 side of this fence. > >>> > >>> The real question is whether it is possible to take an NVDIMM, split it > >>> in half, give each half to two different guests (with appropriate NFIT > >>> tables) and that be sufficient for the guests to just work. > >>> > >> > >> Yes, one NVDIMM device can be split into multiple parts and assigned > >> to different guests, and QEMU is responsible to maintain virtual NFIT > >> tables for each part. > >> > >>> Either way, it needs to be a toolstack policy decision as to how to > >>> split the resource. > > > > Currently, we are using NVDIMM as a block device and a DAX-based > > filesystem > > is created upon it in Linux so that file-related accesses directly reach > > the NVDIMM device. > > > > In KVM, If the NVDIMM device need to be shared by different VMs, we can > > create multiple files on the DAX-based filesystem and assign the file to > > each VMs. In the future, we can enable namespace (partition-like) for > > PMEM > > memory and assign the namespace to each VMs (current Linux driver uses > > the > > whole PMEM as a single namespace). > > > > I think it is not a easy work to let Xen hypervisor recognize NVDIMM > > device > > and manager NVDIMM resource. > > > > Thanks! > > > > The more I see about this, the more sure I am that we want to keep it as > a block device managed by dom0. > > In the case of the DAX-based filesystem, I presume files are not > necessarily contiguous. I also presume that this is worked around by > permuting the mapping of the virtual NVDIMM such that the it appears as > a contiguous block of addresses to the guest? > No, it's not necessary to be contiguous. We can map those none-contiguous parts into a contiguous guest physical address space area and QEMU fills the base and size of area in vNFIT. > Today in Xen, Qemu already has the ability to create mappings in the > guest's address space, e.g. to map PCI device BARs. I don't see a > conceptual difference here, although the security/permission model > certainly is more complicated. > I'm preparing a design document and let's see afterwards what would be a better solution. Thanks, Haozhong
On 01/20/16 14:29, Stefano Stabellini wrote: > On Wed, 20 Jan 2016, Andrew Cooper wrote: > > On 20/01/16 10:36, Xiao Guangrong wrote: > > > > > > Hi, > > > > > > On 01/20/2016 06:15 PM, Haozhong Zhang wrote: > > > > > >> CCing QEMU vNVDIMM maintainer: Xiao Guangrong > > >> > > >>> Conceptually, an NVDIMM is just like a fast SSD which is linearly > > >>> mapped > > >>> into memory. I am still on the dom0 side of this fence. > > >>> > > >>> The real question is whether it is possible to take an NVDIMM, split it > > >>> in half, give each half to two different guests (with appropriate NFIT > > >>> tables) and that be sufficient for the guests to just work. > > >>> > > >> > > >> Yes, one NVDIMM device can be split into multiple parts and assigned > > >> to different guests, and QEMU is responsible to maintain virtual NFIT > > >> tables for each part. > > >> > > >>> Either way, it needs to be a toolstack policy decision as to how to > > >>> split the resource. > > > > > > Currently, we are using NVDIMM as a block device and a DAX-based > > > filesystem > > > is created upon it in Linux so that file-related accesses directly reach > > > the NVDIMM device. > > > > > > In KVM, If the NVDIMM device need to be shared by different VMs, we can > > > create multiple files on the DAX-based filesystem and assign the file to > > > each VMs. In the future, we can enable namespace (partition-like) for > > > PMEM > > > memory and assign the namespace to each VMs (current Linux driver uses > > > the > > > whole PMEM as a single namespace). > > > > > > I think it is not a easy work to let Xen hypervisor recognize NVDIMM > > > device > > > and manager NVDIMM resource. > > > > > > Thanks! > > > > > > > The more I see about this, the more sure I am that we want to keep it as > > a block device managed by dom0. > > > > In the case of the DAX-based filesystem, I presume files are not > > necessarily contiguous. I also presume that this is worked around by > > permuting the mapping of the virtual NVDIMM such that the it appears as > > a contiguous block of addresses to the guest? > > > > Today in Xen, Qemu already has the ability to create mappings in the > > guest's address space, e.g. to map PCI device BARs. I don't see a > > conceptual difference here, although the security/permission model > > certainly is more complicated. > > I imagine that mmap'ing these /dev/pmemXX devices require root > privileges, does it not? > Yes, unless we assign non-root access permissions to /dev/pmemXX (but this is not the default behavior of linux kernel so far). > I wouldn't encourage the introduction of anything else that requires > root privileges in QEMU. With QEMU running as non-root by default in > 4.7, the feature will not be available unless users explicitly ask to > run QEMU as root (which they shouldn't really). > Yes, I'll include those privileged operations in the design document. Haozhong > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 20/01/16 14:29, Stefano Stabellini wrote: > On Wed, 20 Jan 2016, Andrew Cooper wrote: >> On 20/01/16 10:36, Xiao Guangrong wrote: >>> Hi, >>> >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote: >>> >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong >>>> >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly >>>>> mapped >>>>> into memory. I am still on the dom0 side of this fence. >>>>> >>>>> The real question is whether it is possible to take an NVDIMM, split it >>>>> in half, give each half to two different guests (with appropriate NFIT >>>>> tables) and that be sufficient for the guests to just work. >>>>> >>>> Yes, one NVDIMM device can be split into multiple parts and assigned >>>> to different guests, and QEMU is responsible to maintain virtual NFIT >>>> tables for each part. >>>> >>>>> Either way, it needs to be a toolstack policy decision as to how to >>>>> split the resource. >>> Currently, we are using NVDIMM as a block device and a DAX-based >>> filesystem >>> is created upon it in Linux so that file-related accesses directly reach >>> the NVDIMM device. >>> >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can >>> create multiple files on the DAX-based filesystem and assign the file to >>> each VMs. In the future, we can enable namespace (partition-like) for >>> PMEM >>> memory and assign the namespace to each VMs (current Linux driver uses >>> the >>> whole PMEM as a single namespace). >>> >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM >>> device >>> and manager NVDIMM resource. >>> >>> Thanks! >>> >> The more I see about this, the more sure I am that we want to keep it as >> a block device managed by dom0. >> >> In the case of the DAX-based filesystem, I presume files are not >> necessarily contiguous. I also presume that this is worked around by >> permuting the mapping of the virtual NVDIMM such that the it appears as >> a contiguous block of addresses to the guest? >> >> Today in Xen, Qemu already has the ability to create mappings in the >> guest's address space, e.g. to map PCI device BARs. I don't see a >> conceptual difference here, although the security/permission model >> certainly is more complicated. > I imagine that mmap'ing these /dev/pmemXX devices require root > privileges, does it not? I presume it does, although mmap()ing a file on a DAX filesystem will work in the standard POSIX way. Neither of these are sufficient however. That gets Qemu a mapping of the NVDIMM, not the guest. Something, one way or another, has to turn this into appropriate add-to-phymap hypercalls. > > I wouldn't encourage the introduction of anything else that requires > root privileges in QEMU. With QEMU running as non-root by default in > 4.7, the feature will not be available unless users explicitly ask to > run QEMU as root (which they shouldn't really). This isn't how design works. First, design a feature in an architecturally correct way, and then design an security policy to fit. (note, both before implement happens). We should not stunt design based on an existing implementation. In particular, if design shows that being a root only feature is the only sane way of doing this, it should be a root only feature. (I hope this is not the case, but it shouldn't cloud the judgement of a design). ~Andrew
On 01/20/16 14:45, Andrew Cooper wrote: > On 20/01/16 14:29, Stefano Stabellini wrote: > > On Wed, 20 Jan 2016, Andrew Cooper wrote: > >> On 20/01/16 10:36, Xiao Guangrong wrote: > >>> Hi, > >>> > >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote: > >>> > >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong > >>>> > >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly > >>>>> mapped > >>>>> into memory. I am still on the dom0 side of this fence. > >>>>> > >>>>> The real question is whether it is possible to take an NVDIMM, split it > >>>>> in half, give each half to two different guests (with appropriate NFIT > >>>>> tables) and that be sufficient for the guests to just work. > >>>>> > >>>> Yes, one NVDIMM device can be split into multiple parts and assigned > >>>> to different guests, and QEMU is responsible to maintain virtual NFIT > >>>> tables for each part. > >>>> > >>>>> Either way, it needs to be a toolstack policy decision as to how to > >>>>> split the resource. > >>> Currently, we are using NVDIMM as a block device and a DAX-based > >>> filesystem > >>> is created upon it in Linux so that file-related accesses directly reach > >>> the NVDIMM device. > >>> > >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can > >>> create multiple files on the DAX-based filesystem and assign the file to > >>> each VMs. In the future, we can enable namespace (partition-like) for > >>> PMEM > >>> memory and assign the namespace to each VMs (current Linux driver uses > >>> the > >>> whole PMEM as a single namespace). > >>> > >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM > >>> device > >>> and manager NVDIMM resource. > >>> > >>> Thanks! > >>> > >> The more I see about this, the more sure I am that we want to keep it as > >> a block device managed by dom0. > >> > >> In the case of the DAX-based filesystem, I presume files are not > >> necessarily contiguous. I also presume that this is worked around by > >> permuting the mapping of the virtual NVDIMM such that the it appears as > >> a contiguous block of addresses to the guest? > >> > >> Today in Xen, Qemu already has the ability to create mappings in the > >> guest's address space, e.g. to map PCI device BARs. I don't see a > >> conceptual difference here, although the security/permission model > >> certainly is more complicated. > > I imagine that mmap'ing these /dev/pmemXX devices require root > > privileges, does it not? > > I presume it does, although mmap()ing a file on a DAX filesystem will > work in the standard POSIX way. > > Neither of these are sufficient however. That gets Qemu a mapping of > the NVDIMM, not the guest. Something, one way or another, has to turn > this into appropriate add-to-phymap hypercalls. > Yes, those hypercalls are what I'm going to add. Haozhong > > > > I wouldn't encourage the introduction of anything else that requires > > root privileges in QEMU. With QEMU running as non-root by default in > > 4.7, the feature will not be available unless users explicitly ask to > > run QEMU as root (which they shouldn't really). > > This isn't how design works. > > First, design a feature in an architecturally correct way, and then > design an security policy to fit. (note, both before implement happens). > > We should not stunt design based on an existing implementation. In > particular, if design shows that being a root only feature is the only > sane way of doing this, it should be a root only feature. (I hope this > is not the case, but it shouldn't cloud the judgement of a design). > > ~Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Wed, 20 Jan 2016, Andrew Cooper wrote: > On 20/01/16 14:29, Stefano Stabellini wrote: > > On Wed, 20 Jan 2016, Andrew Cooper wrote: > >> On 20/01/16 10:36, Xiao Guangrong wrote: > >>> Hi, > >>> > >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote: > >>> > >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong > >>>> > >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly > >>>>> mapped > >>>>> into memory. I am still on the dom0 side of this fence. > >>>>> > >>>>> The real question is whether it is possible to take an NVDIMM, split it > >>>>> in half, give each half to two different guests (with appropriate NFIT > >>>>> tables) and that be sufficient for the guests to just work. > >>>>> > >>>> Yes, one NVDIMM device can be split into multiple parts and assigned > >>>> to different guests, and QEMU is responsible to maintain virtual NFIT > >>>> tables for each part. > >>>> > >>>>> Either way, it needs to be a toolstack policy decision as to how to > >>>>> split the resource. > >>> Currently, we are using NVDIMM as a block device and a DAX-based > >>> filesystem > >>> is created upon it in Linux so that file-related accesses directly reach > >>> the NVDIMM device. > >>> > >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can > >>> create multiple files on the DAX-based filesystem and assign the file to > >>> each VMs. In the future, we can enable namespace (partition-like) for > >>> PMEM > >>> memory and assign the namespace to each VMs (current Linux driver uses > >>> the > >>> whole PMEM as a single namespace). > >>> > >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM > >>> device > >>> and manager NVDIMM resource. > >>> > >>> Thanks! > >>> > >> The more I see about this, the more sure I am that we want to keep it as > >> a block device managed by dom0. > >> > >> In the case of the DAX-based filesystem, I presume files are not > >> necessarily contiguous. I also presume that this is worked around by > >> permuting the mapping of the virtual NVDIMM such that the it appears as > >> a contiguous block of addresses to the guest? > >> > >> Today in Xen, Qemu already has the ability to create mappings in the > >> guest's address space, e.g. to map PCI device BARs. I don't see a > >> conceptual difference here, although the security/permission model > >> certainly is more complicated. > > I imagine that mmap'ing these /dev/pmemXX devices require root > > privileges, does it not? > > I presume it does, although mmap()ing a file on a DAX filesystem will > work in the standard POSIX way. > > Neither of these are sufficient however. That gets Qemu a mapping of > the NVDIMM, not the guest. Something, one way or another, has to turn > this into appropriate add-to-phymap hypercalls. > > > > > I wouldn't encourage the introduction of anything else that requires > > root privileges in QEMU. With QEMU running as non-root by default in > > 4.7, the feature will not be available unless users explicitly ask to > > run QEMU as root (which they shouldn't really). > > This isn't how design works. > > First, design a feature in an architecturally correct way, and then > design an security policy to fit. > > We should not stunt design based on an existing implementation. In > particular, if design shows that being a root only feature is the only > sane way of doing this, it should be a root only feature. (I hope this > is not the case, but it shouldn't cloud the judgement of a design). I would argue that security is an integral part of the architecture and should not be retrofitted into it. Is it really a good design if the only sane way to implement it is making it a root-only feature? I think not. Designing security policies for pieces of software that don't have the infrastructure for them is costly and that cost should be accounted as part of the overall cost of the solution rather than added to it in a second stage. > (note, both before implement happens). That is ideal but realistically in many cases nobody is able to produce a design before the implementation happens. There is plenty of articles written about this since the 90s / early 00s.
On Wed, Jan 20, 2016 at 07:04:49PM +0800, Haozhong Zhang wrote: > On 01/20/16 01:46, Jan Beulich wrote: > > >>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote: > > > The primary reason of current solution is to reuse existing NVDIMM > > > driver in Linux kernel. > > > > CC'ing QEMU vNVDIMM maintainer: Xiao Guangrong > > > Re-using code in the Dom0 kernel has benefits and drawbacks, and > > in any event needs to depend on proper layering to remain in place. > > A benefit is less code duplication between Xen and Linux; along the > > same lines a drawback is code duplication between various Dom0 > > OS variants. > > > > Not clear about other Dom0 OS. But for Linux, it already has a NVDIMM > driver since 4.2. > > > > One responsibility of this driver is to discover NVDIMM devices and > > > their parameters (e.g. which portion of an NVDIMM device can be mapped > > > into the system address space and which address it is mapped to) by > > > parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of > > > ACPI Specification v6 and the actual code in Linux kernel > > > (drivers/acpi/nfit.*), it's not a trivial task. > > > > To answer one of Kevin's questions: The NFIT table doesn't appear > > to require the ACPI interpreter. They seem more like SRAT and SLIT. > > Sorry, I made a mistake in another reply. NFIT does not contain > anything requiring ACPI interpreter. But there are some _DSM methods > for NVDIMM in SSDT, which needs ACPI interpreter. Right, but those are for health checks and such. Not needed for boot-time discovery of the ranges in memory of the NVDIMM. > > > Also you failed to answer Kevin's question regarding E820 entries: I > > think NVDIMM (or at least parts thereof) get represented in E820 (or > > the EFI memory map), and if that's the case this would be a very > > strong hint towards management needing to be in the hypervisor. > > > > Legacy NVDIMM devices may use E820 entries or other ad-hoc ways to > announce their locations, but newer ones that follow ACPI v6 spec do > not need E820 any more and only need ACPI NFIT (i.e. firmware may not > build E820 entries for them). I am missing something here. Linux pvops uses an hypercall to construct its E820 (XENMEM_machine_memory_map) see arch/x86/xen/setup.c:xen_memory_setup. That hypercall gets an filtered E820 from the hypervisor. And the hypervisor gets the E820 from multiboot2 - which gets it from grub2. With the 'legacy NVDIMM' using E820_NVDIMM (type 12? 13) - they don't show up in multiboot2 - which means Xen will ignore them (not sure if changes them to E820_RSRV or just leaves them alone). Anyhow for the /dev/pmem0 driver in Linux to construct an block device on the E820_NVDIMM - it MUST have the E820 entry - but we don't construct that. I would think that one of the patches would be for the hypervisor to recognize the E820_NVDIMM and associate that area with p2m_mmio (so that the xc_memory_mapping hypercall would work on the MFNs)? But you also mention ACPI v6 defining them an using ACPI NFIT - so that would be treating said system address extracted from the ACPI NFIT just as an MMIO (except it being WB instead of UC). Either way - Xen hypervisor should also parse the ACPI NFIT so that it can mark that range as p2m_mmio (or does it do that by default for any non-E820 ranges?). Does it actually need to do that? Or is that optional? I hope the design document will explain a bit of this. > > The current linux kernel can handle both legacy and new NVDIMM devices > and provide the same block device interface for them. OK, so Xen would need to do that as well - so that the Linux kernel can utilize it. > > > > Secondly, the driver implements a convenient block device interface to > > > let software access areas where NVDIMM devices are mapped. The > > > existing vNVDIMM implementation in QEMU uses this interface. > > > > > > As Linux NVDIMM driver has already done above, why do we bother to > > > reimplement them in Xen? > > > > See above; a possibility is that we may need a split model (block > > layer parts on Dom0, "normal memory" parts in the hypervisor. > > Iirc the split is being determined by firmware, and hence set in > > stone by the time OS (or hypervisor) boot starts. > > > > For the "normal memory" parts, do you mean parts that map the host > NVDIMM device's address space range to the guest? I'm going to > implement that part in hypervisor and expose it as a hypercall so that > it can be used by QEMU. > > Haozhong > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Wed, Jan 20, 2016 at 10:53:10PM +0800, Haozhong Zhang wrote: > On 01/20/16 14:45, Andrew Cooper wrote: > > On 20/01/16 14:29, Stefano Stabellini wrote: > > > On Wed, 20 Jan 2016, Andrew Cooper wrote: > > >> On 20/01/16 10:36, Xiao Guangrong wrote: > > >>> Hi, > > >>> > > >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote: > > >>> > > >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong > > >>>> > > >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly > > >>>>> mapped > > >>>>> into memory. I am still on the dom0 side of this fence. > > >>>>> > > >>>>> The real question is whether it is possible to take an NVDIMM, split it > > >>>>> in half, give each half to two different guests (with appropriate NFIT > > >>>>> tables) and that be sufficient for the guests to just work. > > >>>>> > > >>>> Yes, one NVDIMM device can be split into multiple parts and assigned > > >>>> to different guests, and QEMU is responsible to maintain virtual NFIT > > >>>> tables for each part. > > >>>> > > >>>>> Either way, it needs to be a toolstack policy decision as to how to > > >>>>> split the resource. > > >>> Currently, we are using NVDIMM as a block device and a DAX-based > > >>> filesystem > > >>> is created upon it in Linux so that file-related accesses directly reach > > >>> the NVDIMM device. > > >>> > > >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can > > >>> create multiple files on the DAX-based filesystem and assign the file to > > >>> each VMs. In the future, we can enable namespace (partition-like) for > > >>> PMEM > > >>> memory and assign the namespace to each VMs (current Linux driver uses > > >>> the > > >>> whole PMEM as a single namespace). > > >>> > > >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM > > >>> device > > >>> and manager NVDIMM resource. > > >>> > > >>> Thanks! > > >>> > > >> The more I see about this, the more sure I am that we want to keep it as > > >> a block device managed by dom0. > > >> > > >> In the case of the DAX-based filesystem, I presume files are not > > >> necessarily contiguous. I also presume that this is worked around by > > >> permuting the mapping of the virtual NVDIMM such that the it appears as > > >> a contiguous block of addresses to the guest? > > >> > > >> Today in Xen, Qemu already has the ability to create mappings in the > > >> guest's address space, e.g. to map PCI device BARs. I don't see a > > >> conceptual difference here, although the security/permission model > > >> certainly is more complicated. > > > I imagine that mmap'ing these /dev/pmemXX devices require root > > > privileges, does it not? > > > > I presume it does, although mmap()ing a file on a DAX filesystem will > > work in the standard POSIX way. > > > > Neither of these are sufficient however. That gets Qemu a mapping of > > the NVDIMM, not the guest. Something, one way or another, has to turn > > this into appropriate add-to-phymap hypercalls. > > > > Yes, those hypercalls are what I'm going to add. Why? What you need (in a rought hand-wave way) is to: - mount /dev/pmem0 - mmap the file on /dev/pmem0 FS - walk the VMA for the file - extract the MFN (machien frame numbers) - feed those frame numbers to xc_memory_mapping hypercall. The guest pfns would be contingous. Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on /dev/pmem0 FS - the guest pfns are 0x200000 upward. However the MFNs may be discontingous as the NVDIMM could be an 1TB - and the 8GB file is scattered all over. I believe that is all you would need to do? > > Haozhong > > > > > > > I wouldn't encourage the introduction of anything else that requires > > > root privileges in QEMU. With QEMU running as non-root by default in > > > 4.7, the feature will not be available unless users explicitly ask to > > > run QEMU as root (which they shouldn't really). > > > > This isn't how design works. > > > > First, design a feature in an architecturally correct way, and then > > design an security policy to fit. (note, both before implement happens). > > > > We should not stunt design based on an existing implementation. In > > particular, if design shows that being a root only feature is the only > > sane way of doing this, it should be a root only feature. (I hope this > > is not the case, but it shouldn't cloud the judgement of a design). > > > > ~Andrew > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xen.org > > http://lists.xen.org/xen-devel > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 01/20/16 10:13, Konrad Rzeszutek Wilk wrote: > On Wed, Jan 20, 2016 at 10:53:10PM +0800, Haozhong Zhang wrote: > > On 01/20/16 14:45, Andrew Cooper wrote: > > > On 20/01/16 14:29, Stefano Stabellini wrote: > > > > On Wed, 20 Jan 2016, Andrew Cooper wrote: > > > >> On 20/01/16 10:36, Xiao Guangrong wrote: > > > >>> Hi, > > > >>> > > > >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote: > > > >>> > > > >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong > > > >>>> > > > >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly > > > >>>>> mapped > > > >>>>> into memory. I am still on the dom0 side of this fence. > > > >>>>> > > > >>>>> The real question is whether it is possible to take an NVDIMM, split it > > > >>>>> in half, give each half to two different guests (with appropriate NFIT > > > >>>>> tables) and that be sufficient for the guests to just work. > > > >>>>> > > > >>>> Yes, one NVDIMM device can be split into multiple parts and assigned > > > >>>> to different guests, and QEMU is responsible to maintain virtual NFIT > > > >>>> tables for each part. > > > >>>> > > > >>>>> Either way, it needs to be a toolstack policy decision as to how to > > > >>>>> split the resource. > > > >>> Currently, we are using NVDIMM as a block device and a DAX-based > > > >>> filesystem > > > >>> is created upon it in Linux so that file-related accesses directly reach > > > >>> the NVDIMM device. > > > >>> > > > >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can > > > >>> create multiple files on the DAX-based filesystem and assign the file to > > > >>> each VMs. In the future, we can enable namespace (partition-like) for > > > >>> PMEM > > > >>> memory and assign the namespace to each VMs (current Linux driver uses > > > >>> the > > > >>> whole PMEM as a single namespace). > > > >>> > > > >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM > > > >>> device > > > >>> and manager NVDIMM resource. > > > >>> > > > >>> Thanks! > > > >>> > > > >> The more I see about this, the more sure I am that we want to keep it as > > > >> a block device managed by dom0. > > > >> > > > >> In the case of the DAX-based filesystem, I presume files are not > > > >> necessarily contiguous. I also presume that this is worked around by > > > >> permuting the mapping of the virtual NVDIMM such that the it appears as > > > >> a contiguous block of addresses to the guest? > > > >> > > > >> Today in Xen, Qemu already has the ability to create mappings in the > > > >> guest's address space, e.g. to map PCI device BARs. I don't see a > > > >> conceptual difference here, although the security/permission model > > > >> certainly is more complicated. > > > > I imagine that mmap'ing these /dev/pmemXX devices require root > > > > privileges, does it not? > > > > > > I presume it does, although mmap()ing a file on a DAX filesystem will > > > work in the standard POSIX way. > > > > > > Neither of these are sufficient however. That gets Qemu a mapping of > > > the NVDIMM, not the guest. Something, one way or another, has to turn > > > this into appropriate add-to-phymap hypercalls. > > > > > > > Yes, those hypercalls are what I'm going to add. > > Why? > > What you need (in a rought hand-wave way) is to: > - mount /dev/pmem0 > - mmap the file on /dev/pmem0 FS > - walk the VMA for the file - extract the MFN (machien frame numbers) Can this step be done by QEMU? Or does linux kernel provide some approach for the userspace to do the translation? Haozhong > - feed those frame numbers to xc_memory_mapping hypercall. The > guest pfns would be contingous. > Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on > /dev/pmem0 FS - the guest pfns are 0x200000 upward. > > However the MFNs may be discontingous as the NVDIMM could be an > 1TB - and the 8GB file is scattered all over. > > I believe that is all you would need to do? > > > > Haozhong > > > > > > > > > > I wouldn't encourage the introduction of anything else that requires > > > > root privileges in QEMU. With QEMU running as non-root by default in > > > > 4.7, the feature will not be available unless users explicitly ask to > > > > run QEMU as root (which they shouldn't really). > > > > > > This isn't how design works. > > > > > > First, design a feature in an architecturally correct way, and then > > > design an security policy to fit. (note, both before implement happens). > > > > > > We should not stunt design based on an existing implementation. In > > > particular, if design shows that being a root only feature is the only > > > sane way of doing this, it should be a root only feature. (I hope this > > > is not the case, but it shouldn't cloud the judgement of a design). > > > > > > ~Andrew > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xen.org > > > http://lists.xen.org/xen-devel > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xen.org > > http://lists.xen.org/xen-devel > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 01/20/2016 07:20 PM, Jan Beulich wrote: >>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote: >> On 01/20/16 01:46, Jan Beulich wrote: >>>>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote: >>>> Secondly, the driver implements a convenient block device interface to >>>> let software access areas where NVDIMM devices are mapped. The >>>> existing vNVDIMM implementation in QEMU uses this interface. >>>> >>>> As Linux NVDIMM driver has already done above, why do we bother to >>>> reimplement them in Xen? >>> >>> See above; a possibility is that we may need a split model (block >>> layer parts on Dom0, "normal memory" parts in the hypervisor. >>> Iirc the split is being determined by firmware, and hence set in >>> stone by the time OS (or hypervisor) boot starts. >> >> For the "normal memory" parts, do you mean parts that map the host >> NVDIMM device's address space range to the guest? I'm going to >> implement that part in hypervisor and expose it as a hypercall so that >> it can be used by QEMU. > > To answer this I need to have my understanding of the partitioning > being done by firmware confirmed: If that's the case, then "normal" > means the part that doesn't get exposed as a block device (SSD). > In any event there's no correlation to guest exposure here. Firmware does not manage NVDIMM. All the operations of nvdimm are handled by OS. Actually, there are lots of things we should take into account if we move the NVDIMM management to hypervisor: a) ACPI NFIT interpretation A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the base information of NVDIMM devices which includes PMEM info, PBLK info, nvdimm device interleave, vendor info, etc. Let me explain it one by one. PMEM and PBLK are two modes to access NVDIMM devices: 1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address space so that CPU can r/w it directly. 2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM only offers two windows which are mapped to CPU's address space, the data window and access window, so that CPU can use these two windows to access the whole NVDIMM device. NVDIMM device is interleaved whose info is also exported so that we can calculate the address to access the specified NVDIMM device. NVDIMM devices from different vendor can have different function so that the vendor info is exported by NFIT to make vendor's driver work. b) ACPI SSDT interpretation SSDT offers _DSM method which controls NVDIMM device, such as label operation, health check etc and hotplug support. c) Resource management NVDIMM resource management challenged as: 1) PMEM is huge and it is little slower access than RAM so it is not suitable to manage it as page struct (i think it is not a big problem in Xen hypervisor?) 2) need to partition it to it be used in multiple VMs. 3) need to support PBLK and partition it in the future. d) management tools support S.M.A.R.T? error detection and recovering? c) hotplug support d) third parts drivers Vendor drivers need to be ported to xen hypervisor and let it be supported in the management tool. e) ...
> > > > Neither of these are sufficient however. That gets Qemu a mapping of > > > > the NVDIMM, not the guest. Something, one way or another, has to turn > > > > this into appropriate add-to-phymap hypercalls. > > > > > > > > > > Yes, those hypercalls are what I'm going to add. > > > > Why? > > > > What you need (in a rought hand-wave way) is to: > > - mount /dev/pmem0 > > - mmap the file on /dev/pmem0 FS > > - walk the VMA for the file - extract the MFN (machien frame numbers) > > Can this step be done by QEMU? Or does linux kernel provide some > approach for the userspace to do the translation? I don't know. I would think no - as you wouldn't want the userspace application to figure out the physical frames from the virtual address (unless they are root). But then if you look in /proc/<pid>/maps and /proc/<pid>/smaps there are some data there. Hm, /proc/<pid>/pagemaps has something intersting See pagemap_read function. That looks to be doing it? > > Haozhong > > > - feed those frame numbers to xc_memory_mapping hypercall. The > > guest pfns would be contingous. > > Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on > > /dev/pmem0 FS - the guest pfns are 0x200000 upward. > > > > However the MFNs may be discontingous as the NVDIMM could be an > > 1TB - and the 8GB file is scattered all over. > > > > I believe that is all you would need to do?
On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote: > > > On 01/20/2016 07:20 PM, Jan Beulich wrote: > >>>>On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote: > >>On 01/20/16 01:46, Jan Beulich wrote: > >>>>>>On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote: > >>>>Secondly, the driver implements a convenient block device interface to > >>>>let software access areas where NVDIMM devices are mapped. The > >>>>existing vNVDIMM implementation in QEMU uses this interface. > >>>> > >>>>As Linux NVDIMM driver has already done above, why do we bother to > >>>>reimplement them in Xen? > >>> > >>>See above; a possibility is that we may need a split model (block > >>>layer parts on Dom0, "normal memory" parts in the hypervisor. > >>>Iirc the split is being determined by firmware, and hence set in > >>>stone by the time OS (or hypervisor) boot starts. > >> > >>For the "normal memory" parts, do you mean parts that map the host > >>NVDIMM device's address space range to the guest? I'm going to > >>implement that part in hypervisor and expose it as a hypercall so that > >>it can be used by QEMU. > > > >To answer this I need to have my understanding of the partitioning > >being done by firmware confirmed: If that's the case, then "normal" > >means the part that doesn't get exposed as a block device (SSD). > >In any event there's no correlation to guest exposure here. > > Firmware does not manage NVDIMM. All the operations of nvdimm are handled > by OS. > > Actually, there are lots of things we should take into account if we move > the NVDIMM management to hypervisor: If you remove the block device part and just deal with pmem part then this gets smaller. Also the _DSM operations - I can't see them being in hypervisor - but only in the dom0 - which would have the right software to tickle the correct ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform an SMART operation, etc). > a) ACPI NFIT interpretation > A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the > base information of NVDIMM devices which includes PMEM info, PBLK > info, nvdimm device interleave, vendor info, etc. Let me explain it one > by one. And it is a static table. As in part of the MADT. > > PMEM and PBLK are two modes to access NVDIMM devices: > 1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address > space so that CPU can r/w it directly. > 2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM > only offers two windows which are mapped to CPU's address space, the data > window and access window, so that CPU can use these two windows to access > the whole NVDIMM device. > > NVDIMM device is interleaved whose info is also exported so that we can > calculate the address to access the specified NVDIMM device. Right, along with the serial numbers. > > NVDIMM devices from different vendor can have different function so that the > vendor info is exported by NFIT to make vendor's driver work. via _DSM right? > > b) ACPI SSDT interpretation > SSDT offers _DSM method which controls NVDIMM device, such as label operation, > health check etc and hotplug support. Sounds like the control domain (dom0) would be in charge of that. > > c) Resource management > NVDIMM resource management challenged as: > 1) PMEM is huge and it is little slower access than RAM so it is not suitable > to manage it as page struct (i think it is not a big problem in Xen > hypervisor?) > 2) need to partition it to it be used in multiple VMs. > 3) need to support PBLK and partition it in the future. That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor. > > d) management tools support > S.M.A.R.T? error detection and recovering? > > c) hotplug support How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS to scan. That would require the hypervisor also reading this for it to update it's data-structures. > > d) third parts drivers > Vendor drivers need to be ported to xen hypervisor and let it be supported in > the management tool. Ewww. I presume the 'third party drivers' mean more interesting _DSM features right? On the base level the firmware with this type of NVDIMM would still have the basic - ACPI NFIT + E820_NVDIMM (optional). > > e) ... > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 01/20/16 10:41, Konrad Rzeszutek Wilk wrote: > > > > > Neither of these are sufficient however. That gets Qemu a mapping of > > > > > the NVDIMM, not the guest. Something, one way or another, has to turn > > > > > this into appropriate add-to-phymap hypercalls. > > > > > > > > > > > > > Yes, those hypercalls are what I'm going to add. > > > > > > Why? > > > > > > What you need (in a rought hand-wave way) is to: > > > - mount /dev/pmem0 > > > - mmap the file on /dev/pmem0 FS > > > - walk the VMA for the file - extract the MFN (machien frame numbers) > > > > Can this step be done by QEMU? Or does linux kernel provide some > > approach for the userspace to do the translation? > > I don't know. I would think no - as you wouldn't want the userspace > application to figure out the physical frames from the virtual > address (unless they are root). But then if you look in > /proc/<pid>/maps and /proc/<pid>/smaps there are some data there. > > Hm, /proc/<pid>/pagemaps has something intersting > > See pagemap_read function. That looks to be doing it? > Interesting and good to know this. I'll have a look at it. Thanks, Haozhong > > > > Haozhong > > > > > - feed those frame numbers to xc_memory_mapping hypercall. The > > > guest pfns would be contingous. > > > Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on > > > /dev/pmem0 FS - the guest pfns are 0x200000 upward. > > > > > > However the MFNs may be discontingous as the NVDIMM could be an > > > 1TB - and the 8GB file is scattered all over. > > > > > > I believe that is all you would need to do? > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 01/20/2016 11:47 PM, Konrad Rzeszutek Wilk wrote: > On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote: >> >> >> On 01/20/2016 07:20 PM, Jan Beulich wrote: >>>>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote: >>>> On 01/20/16 01:46, Jan Beulich wrote: >>>>>>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote: >>>>>> Secondly, the driver implements a convenient block device interface to >>>>>> let software access areas where NVDIMM devices are mapped. The >>>>>> existing vNVDIMM implementation in QEMU uses this interface. >>>>>> >>>>>> As Linux NVDIMM driver has already done above, why do we bother to >>>>>> reimplement them in Xen? >>>>> >>>>> See above; a possibility is that we may need a split model (block >>>>> layer parts on Dom0, "normal memory" parts in the hypervisor. >>>>> Iirc the split is being determined by firmware, and hence set in >>>>> stone by the time OS (or hypervisor) boot starts. >>>> >>>> For the "normal memory" parts, do you mean parts that map the host >>>> NVDIMM device's address space range to the guest? I'm going to >>>> implement that part in hypervisor and expose it as a hypercall so that >>>> it can be used by QEMU. >>> >>> To answer this I need to have my understanding of the partitioning >>> being done by firmware confirmed: If that's the case, then "normal" >>> means the part that doesn't get exposed as a block device (SSD). >>> In any event there's no correlation to guest exposure here. >> >> Firmware does not manage NVDIMM. All the operations of nvdimm are handled >> by OS. >> >> Actually, there are lots of things we should take into account if we move >> the NVDIMM management to hypervisor: > > If you remove the block device part and just deal with pmem part then this > gets smaller. > Yes indeed. But xen can not benefit from NVDIMM BLK, i think it is not a long time plan. :) > Also the _DSM operations - I can't see them being in hypervisor - but only > in the dom0 - which would have the right software to tickle the correct > ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform > an SMART operation, etc). Yes, it is reasonable to put it in dom 0 and it makes management tools happy. > >> a) ACPI NFIT interpretation >> A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the >> base information of NVDIMM devices which includes PMEM info, PBLK >> info, nvdimm device interleave, vendor info, etc. Let me explain it one >> by one. > > And it is a static table. As in part of the MADT. Yes, it is, but we need to fetch updated nvdimm info from _FIT in SSDT/DSDT instead if a nvdimm device is hotpluged, please see below. >> >> PMEM and PBLK are two modes to access NVDIMM devices: >> 1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address >> space so that CPU can r/w it directly. >> 2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM >> only offers two windows which are mapped to CPU's address space, the data >> window and access window, so that CPU can use these two windows to access >> the whole NVDIMM device. >> >> NVDIMM device is interleaved whose info is also exported so that we can >> calculate the address to access the specified NVDIMM device. > > Right, along with the serial numbers. >> >> NVDIMM devices from different vendor can have different function so that the >> vendor info is exported by NFIT to make vendor's driver work. > > via _DSM right? Yes. >> >> b) ACPI SSDT interpretation >> SSDT offers _DSM method which controls NVDIMM device, such as label operation, >> health check etc and hotplug support. > > Sounds like the control domain (dom0) would be in charge of that. Yup. Dom0 is a better place to handle it. >> >> c) Resource management >> NVDIMM resource management challenged as: >> 1) PMEM is huge and it is little slower access than RAM so it is not suitable >> to manage it as page struct (i think it is not a big problem in Xen >> hypervisor?) >> 2) need to partition it to it be used in multiple VMs. >> 3) need to support PBLK and partition it in the future. > > That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor. Sure, so let dom0 handle this is better, we are on the same page. :) >> >> d) management tools support >> S.M.A.R.T? error detection and recovering? >> >> c) hotplug support > > How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS > to scan. That would require the hypervisor also reading this for it to > update it's data-structures. Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface, _FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is the better place handing this case too. >> >> d) third parts drivers >> Vendor drivers need to be ported to xen hypervisor and let it be supported in >> the management tool. > > Ewww. > > I presume the 'third party drivers' mean more interesting _DSM features right? Yes. > On the base level the firmware with this type of NVDIMM would still have > the basic - ACPI NFIT + E820_NVDIMM (optional). >> Yes.
On Thu, Jan 21, 2016 at 12:25:08AM +0800, Xiao Guangrong wrote: > > > On 01/20/2016 11:47 PM, Konrad Rzeszutek Wilk wrote: > >On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote: > >> > >> > >>On 01/20/2016 07:20 PM, Jan Beulich wrote: > >>>>>>On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote: > >>>>On 01/20/16 01:46, Jan Beulich wrote: > >>>>>>>>On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote: > >>>>>>Secondly, the driver implements a convenient block device interface to > >>>>>>let software access areas where NVDIMM devices are mapped. The > >>>>>>existing vNVDIMM implementation in QEMU uses this interface. > >>>>>> > >>>>>>As Linux NVDIMM driver has already done above, why do we bother to > >>>>>>reimplement them in Xen? > >>>>> > >>>>>See above; a possibility is that we may need a split model (block > >>>>>layer parts on Dom0, "normal memory" parts in the hypervisor. > >>>>>Iirc the split is being determined by firmware, and hence set in > >>>>>stone by the time OS (or hypervisor) boot starts. > >>>> > >>>>For the "normal memory" parts, do you mean parts that map the host > >>>>NVDIMM device's address space range to the guest? I'm going to > >>>>implement that part in hypervisor and expose it as a hypercall so that > >>>>it can be used by QEMU. > >>> > >>>To answer this I need to have my understanding of the partitioning > >>>being done by firmware confirmed: If that's the case, then "normal" > >>>means the part that doesn't get exposed as a block device (SSD). > >>>In any event there's no correlation to guest exposure here. > >> > >>Firmware does not manage NVDIMM. All the operations of nvdimm are handled > >>by OS. > >> > >>Actually, there are lots of things we should take into account if we move > >>the NVDIMM management to hypervisor: > > > >If you remove the block device part and just deal with pmem part then this > >gets smaller. > > > > Yes indeed. But xen can not benefit from NVDIMM BLK, i think it is not a long > time plan. :) > > >Also the _DSM operations - I can't see them being in hypervisor - but only > >in the dom0 - which would have the right software to tickle the correct > >ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform > >an SMART operation, etc). > > Yes, it is reasonable to put it in dom 0 and it makes management tools happy. > > > > >>a) ACPI NFIT interpretation > >> A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the > >> base information of NVDIMM devices which includes PMEM info, PBLK > >> info, nvdimm device interleave, vendor info, etc. Let me explain it one > >> by one. > > > >And it is a static table. As in part of the MADT. > > Yes, it is, but we need to fetch updated nvdimm info from _FIT in SSDT/DSDT instead > if a nvdimm device is hotpluged, please see below. > > >> > >> PMEM and PBLK are two modes to access NVDIMM devices: > >> 1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address > >> space so that CPU can r/w it directly. > >> 2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM > >> only offers two windows which are mapped to CPU's address space, the data > >> window and access window, so that CPU can use these two windows to access > >> the whole NVDIMM device. > >> > >> NVDIMM device is interleaved whose info is also exported so that we can > >> calculate the address to access the specified NVDIMM device. > > > >Right, along with the serial numbers. > >> > >> NVDIMM devices from different vendor can have different function so that the > >> vendor info is exported by NFIT to make vendor's driver work. > > > >via _DSM right? > > Yes. > > >> > >>b) ACPI SSDT interpretation > >> SSDT offers _DSM method which controls NVDIMM device, such as label operation, > >> health check etc and hotplug support. > > > >Sounds like the control domain (dom0) would be in charge of that. > > Yup. Dom0 is a better place to handle it. > > >> > >>c) Resource management > >> NVDIMM resource management challenged as: > >> 1) PMEM is huge and it is little slower access than RAM so it is not suitable > >> to manage it as page struct (i think it is not a big problem in Xen > >> hypervisor?) > >> 2) need to partition it to it be used in multiple VMs. > >> 3) need to support PBLK and partition it in the future. > > > >That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor. > > Sure, so let dom0 handle this is better, we are on the same page. :) > > >> > >>d) management tools support > >> S.M.A.R.T? error detection and recovering? > >> > >>c) hotplug support > > > >How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS > >to scan. That would require the hypervisor also reading this for it to > >update it's data-structures. > > Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface, > _FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is > the better place handing this case too. That one is a bit difficult. Both the OS and the hypervisor would need to know about this (I think?). dom0 since it gets the ACPI event and needs to process it. Then the hypervisor needs to be told so it can slurp it up. However I don't know if the hypervisor needs to know all the details of an NVDIMM - or just the starting and ending ranges so that when an guest is created and the VT-d is constructed - it can be assured that the ranges are valid. I am not an expert on the P2M code - but I think that would need to be looked at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN. > > >> > >>d) third parts drivers > >> Vendor drivers need to be ported to xen hypervisor and let it be supported in > >> the management tool. > > > >Ewww. > > > >I presume the 'third party drivers' mean more interesting _DSM features right? > > Yes. > > >On the base level the firmware with this type of NVDIMM would still have > >the basic - ACPI NFIT + E820_NVDIMM (optional). > >> > > Yes.
On 01/21/2016 12:47 AM, Konrad Rzeszutek Wilk wrote: > On Thu, Jan 21, 2016 at 12:25:08AM +0800, Xiao Guangrong wrote: >> >> >> On 01/20/2016 11:47 PM, Konrad Rzeszutek Wilk wrote: >>> On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote: >>>> >>>> >>>> On 01/20/2016 07:20 PM, Jan Beulich wrote: >>>>>>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote: >>>>>> On 01/20/16 01:46, Jan Beulich wrote: >>>>>>>>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote: >>>>>>>> Secondly, the driver implements a convenient block device interface to >>>>>>>> let software access areas where NVDIMM devices are mapped. The >>>>>>>> existing vNVDIMM implementation in QEMU uses this interface. >>>>>>>> >>>>>>>> As Linux NVDIMM driver has already done above, why do we bother to >>>>>>>> reimplement them in Xen? >>>>>>> >>>>>>> See above; a possibility is that we may need a split model (block >>>>>>> layer parts on Dom0, "normal memory" parts in the hypervisor. >>>>>>> Iirc the split is being determined by firmware, and hence set in >>>>>>> stone by the time OS (or hypervisor) boot starts. >>>>>> >>>>>> For the "normal memory" parts, do you mean parts that map the host >>>>>> NVDIMM device's address space range to the guest? I'm going to >>>>>> implement that part in hypervisor and expose it as a hypercall so that >>>>>> it can be used by QEMU. >>>>> >>>>> To answer this I need to have my understanding of the partitioning >>>>> being done by firmware confirmed: If that's the case, then "normal" >>>>> means the part that doesn't get exposed as a block device (SSD). >>>>> In any event there's no correlation to guest exposure here. >>>> >>>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled >>>> by OS. >>>> >>>> Actually, there are lots of things we should take into account if we move >>>> the NVDIMM management to hypervisor: >>> >>> If you remove the block device part and just deal with pmem part then this >>> gets smaller. >>> >> >> Yes indeed. But xen can not benefit from NVDIMM BLK, i think it is not a long >> time plan. :) >> >>> Also the _DSM operations - I can't see them being in hypervisor - but only >>> in the dom0 - which would have the right software to tickle the correct >>> ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform >>> an SMART operation, etc). >> >> Yes, it is reasonable to put it in dom 0 and it makes management tools happy. >> >>> >>>> a) ACPI NFIT interpretation >>>> A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the >>>> base information of NVDIMM devices which includes PMEM info, PBLK >>>> info, nvdimm device interleave, vendor info, etc. Let me explain it one >>>> by one. >>> >>> And it is a static table. As in part of the MADT. >> >> Yes, it is, but we need to fetch updated nvdimm info from _FIT in SSDT/DSDT instead >> if a nvdimm device is hotpluged, please see below. >> >>>> >>>> PMEM and PBLK are two modes to access NVDIMM devices: >>>> 1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address >>>> space so that CPU can r/w it directly. >>>> 2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM >>>> only offers two windows which are mapped to CPU's address space, the data >>>> window and access window, so that CPU can use these two windows to access >>>> the whole NVDIMM device. >>>> >>>> NVDIMM device is interleaved whose info is also exported so that we can >>>> calculate the address to access the specified NVDIMM device. >>> >>> Right, along with the serial numbers. >>>> >>>> NVDIMM devices from different vendor can have different function so that the >>>> vendor info is exported by NFIT to make vendor's driver work. >>> >>> via _DSM right? >> >> Yes. >> >>>> >>>> b) ACPI SSDT interpretation >>>> SSDT offers _DSM method which controls NVDIMM device, such as label operation, >>>> health check etc and hotplug support. >>> >>> Sounds like the control domain (dom0) would be in charge of that. >> >> Yup. Dom0 is a better place to handle it. >> >>>> >>>> c) Resource management >>>> NVDIMM resource management challenged as: >>>> 1) PMEM is huge and it is little slower access than RAM so it is not suitable >>>> to manage it as page struct (i think it is not a big problem in Xen >>>> hypervisor?) >>>> 2) need to partition it to it be used in multiple VMs. >>>> 3) need to support PBLK and partition it in the future. >>> >>> That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor. >> >> Sure, so let dom0 handle this is better, we are on the same page. :) >> >>>> >>>> d) management tools support >>>> S.M.A.R.T? error detection and recovering? >>>> >>>> c) hotplug support >>> >>> How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS >>> to scan. That would require the hypervisor also reading this for it to >>> update it's data-structures. >> >> Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface, >> _FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is >> the better place handing this case too. > > That one is a bit difficult. Both the OS and the hypervisor would need to know about > this (I think?). dom0 since it gets the ACPI event and needs to process it. Then > the hypervisor needs to be told so it can slurp it up. Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0 handle all the things like native. If it can not, dom0 can interpret ACPI and fetch the irq info out and tell hypervior to pass the irq to dom0, it is doable? > > However I don't know if the hypervisor needs to know all the details of an > NVDIMM - or just the starting and ending ranges so that when an guest is created > and the VT-d is constructed - it can be assured that the ranges are valid. > > I am not an expert on the P2M code - but I think that would need to be looked > at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN. We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug, lable support (namespace)...
>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote: > On 01/20/2016 07:20 PM, Jan Beulich wrote: >> To answer this I need to have my understanding of the partitioning >> being done by firmware confirmed: If that's the case, then "normal" >> means the part that doesn't get exposed as a block device (SSD). >> In any event there's no correlation to guest exposure here. > > Firmware does not manage NVDIMM. All the operations of nvdimm are handled > by OS. > > Actually, there are lots of things we should take into account if we move > the NVDIMM management to hypervisor: > a) ACPI NFIT interpretation > A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the > base information of NVDIMM devices which includes PMEM info, PBLK > info, nvdimm device interleave, vendor info, etc. Let me explain it one > by one. > > PMEM and PBLK are two modes to access NVDIMM devices: > 1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address > space so that CPU can r/w it directly. > 2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM > only offers two windows which are mapped to CPU's address space, the data > window and access window, so that CPU can use these two windows to access > the whole NVDIMM device. You fail to mention PBLK. The question above really was about what entity controls which of the two modes get used (and perhaps for which parts of the overall NVDIMM). Jan
On 01/21/2016 01:07 AM, Jan Beulich wrote: >>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote: >> On 01/20/2016 07:20 PM, Jan Beulich wrote: >>> To answer this I need to have my understanding of the partitioning >>> being done by firmware confirmed: If that's the case, then "normal" >>> means the part that doesn't get exposed as a block device (SSD). >>> In any event there's no correlation to guest exposure here. >> >> Firmware does not manage NVDIMM. All the operations of nvdimm are handled >> by OS. >> >> Actually, there are lots of things we should take into account if we move >> the NVDIMM management to hypervisor: >> a) ACPI NFIT interpretation >> A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the >> base information of NVDIMM devices which includes PMEM info, PBLK >> info, nvdimm device interleave, vendor info, etc. Let me explain it one >> by one. >> >> PMEM and PBLK are two modes to access NVDIMM devices: >> 1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address >> space so that CPU can r/w it directly. >> 2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM >> only offers two windows which are mapped to CPU's address space, the data >> window and access window, so that CPU can use these two windows to access >> the whole NVDIMM device. > > You fail to mention PBLK. The question above really was about what The 2) is PBLK. > entity controls which of the two modes get used (and perhaps for > which parts of the overall NVDIMM). So i think the "normal" you mentioned is about PMEM. :)
> >>>>c) hotplug support > >>> > >>>How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS > >>>to scan. That would require the hypervisor also reading this for it to > >>>update it's data-structures. > >> > >>Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface, > >>_FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is > >>the better place handing this case too. > > > >That one is a bit difficult. Both the OS and the hypervisor would need to know about > >this (I think?). dom0 since it gets the ACPI event and needs to process it. Then > >the hypervisor needs to be told so it can slurp it up. > > Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0 Yes of course it can. > handle all the things like native. If it can not, dom0 can interpret ACPI and fetch > the irq info out and tell hypervior to pass the irq to dom0, it is doable? > > > > >However I don't know if the hypervisor needs to know all the details of an > >NVDIMM - or just the starting and ending ranges so that when an guest is created > >and the VT-d is constructed - it can be assured that the ranges are valid. > > > >I am not an expert on the P2M code - but I think that would need to be looked > >at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN. > > We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug, > lable support (namespace)... <hand-waves> I don't know what QEMU does for guests? I naively assumed it would create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have the _DSM). Either way what I think you need to investigate is what is neccessary for the Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for the NVDIMM. Based on that - you will know what kind of exposure the hypervisor needs to the _FIT and NFIT tables. (Adding Feng Wu, the VT-d maintainer).
On 01/21/2016 01:18 AM, Konrad Rzeszutek Wilk wrote: >>>>>> c) hotplug support >>>>> >>>>> How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS >>>>> to scan. That would require the hypervisor also reading this for it to >>>>> update it's data-structures. >>>> >>>> Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface, >>>> _FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is >>>> the better place handing this case too. >>> >>> That one is a bit difficult. Both the OS and the hypervisor would need to know about >>> this (I think?). dom0 since it gets the ACPI event and needs to process it. Then >>> the hypervisor needs to be told so it can slurp it up. >> >> Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0 > > Yes of course it can. >> handle all the things like native. If it can not, dom0 can interpret ACPI and fetch >> the irq info out and tell hypervior to pass the irq to dom0, it is doable? >> >>> >>> However I don't know if the hypervisor needs to know all the details of an >>> NVDIMM - or just the starting and ending ranges so that when an guest is created >>> and the VT-d is constructed - it can be assured that the ranges are valid. >>> >>> I am not an expert on the P2M code - but I think that would need to be looked >>> at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN. >> >> We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug, >> lable support (namespace)... > > <hand-waves> I don't know what QEMU does for guests? I naively assumed it would > create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have > the _DSM). Ah, ACPI eliminates this E820 entry. > > Either way what I think you need to investigate is what is neccessary for the > Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for > the NVDIMM. Based on that - you will know what kind of exposure the hypervisor > needs to the _FIT and NFIT tables. > Interesting. I did not consider using NVDIMM as DMA. Do you have usecase for this kind of NVDIMM usage?
On Thu, Jan 21, 2016 at 01:23:31AM +0800, Xiao Guangrong wrote: > > > On 01/21/2016 01:18 AM, Konrad Rzeszutek Wilk wrote: > >>>>>>c) hotplug support > >>>>> > >>>>>How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS > >>>>>to scan. That would require the hypervisor also reading this for it to > >>>>>update it's data-structures. > >>>> > >>>>Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface, > >>>>_FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is > >>>>the better place handing this case too. > >>> > >>>That one is a bit difficult. Both the OS and the hypervisor would need to know about > >>>this (I think?). dom0 since it gets the ACPI event and needs to process it. Then > >>>the hypervisor needs to be told so it can slurp it up. > >> > >>Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0 > > > >Yes of course it can. > >>handle all the things like native. If it can not, dom0 can interpret ACPI and fetch > >>the irq info out and tell hypervior to pass the irq to dom0, it is doable? > >> > >>> > >>>However I don't know if the hypervisor needs to know all the details of an > >>>NVDIMM - or just the starting and ending ranges so that when an guest is created > >>>and the VT-d is constructed - it can be assured that the ranges are valid. > >>> > >>>I am not an expert on the P2M code - but I think that would need to be looked > >>>at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN. > >> > >>We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug, > >>lable support (namespace)... > > > ><hand-waves> I don't know what QEMU does for guests? I naively assumed it would > >create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have > >the _DSM). > > Ah, ACPI eliminates this E820 entry. > > > > >Either way what I think you need to investigate is what is neccessary for the > >Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for > >the NVDIMM. Based on that - you will know what kind of exposure the hypervisor > >needs to the _FIT and NFIT tables. > > > > Interesting. I did not consider using NVDIMM as DMA. Do you have usecase for this > kind of NVDIMM usage? An easy one is iSCSI target. You could have an SR-IOV NIC that would have TCM enabled (CONFIG_TCM_FILEIO or CONFIG_TCM_IBLOCK). Mount an file on the /dev/pmem0 (using DAX enabled FS) and export it as iSCSI LUN. The traffic would go over an SR-IOV NIC. The DMA transactions would be SR-IOV NIC <-> NVDIMM.
On 20/01/16 15:05, Stefano Stabellini wrote: > On Wed, 20 Jan 2016, Andrew Cooper wrote: >> On 20/01/16 14:29, Stefano Stabellini wrote: >>> On Wed, 20 Jan 2016, Andrew Cooper wrote: >>>> >>> I wouldn't encourage the introduction of anything else that requires >>> root privileges in QEMU. With QEMU running as non-root by default in >>> 4.7, the feature will not be available unless users explicitly ask to >>> run QEMU as root (which they shouldn't really). >> This isn't how design works. >> >> First, design a feature in an architecturally correct way, and then >> design an security policy to fit. >> >> We should not stunt design based on an existing implementation. In >> particular, if design shows that being a root only feature is the only >> sane way of doing this, it should be a root only feature. (I hope this >> is not the case, but it shouldn't cloud the judgement of a design). > I would argue that security is an integral part of the architecture and > should not be retrofitted into it. There is no retrofitting - it is all part of the same overall design before coding starts happen. > > Is it really a good design if the only sane way to implement it is > making it a root-only feature? I think not. Then you have missed the point. If you fail at architecting the feature in the first place, someone else is going to have to come along and reimplement it properly, then provide some form of compatibility with the old one. Security is an important consideration in the design; I do not wish to understate that. However, if the only way for a feature to be architected properly is for the feature to be a root-only feature, then it should be a root-only feature. > Designing security policies > for pieces of software that don't have the infrastructure for them is > costly and that cost should be accounted as part of the overall cost of > the solution rather than added to it in a second stage. That cost is far better spent designing it properly in the first place, rather than having to come along and reimplement a v2 because v1 was broken. > > >> (note, both before implement happens). > That is ideal but realistically in many cases nobody is able to produce > a design before the implementation happens. It is perfectly easy. This is the difference between software engineering and software hacking. There has been a lot of positive feedback from on-list design documents. It is a trend which needs to continue. ~Andrew
On 01/20/16 12:18, Konrad Rzeszutek Wilk wrote: > > >>>>c) hotplug support > > >>> > > >>>How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS > > >>>to scan. That would require the hypervisor also reading this for it to > > >>>update it's data-structures. > > >> > > >>Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface, > > >>_FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is > > >>the better place handing this case too. > > > > > >That one is a bit difficult. Both the OS and the hypervisor would need to know about > > >this (I think?). dom0 since it gets the ACPI event and needs to process it. Then > > >the hypervisor needs to be told so it can slurp it up. > > > > Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0 > > Yes of course it can. > > handle all the things like native. If it can not, dom0 can interpret ACPI and fetch > > the irq info out and tell hypervior to pass the irq to dom0, it is doable? > > > > > > > >However I don't know if the hypervisor needs to know all the details of an > > >NVDIMM - or just the starting and ending ranges so that when an guest is created > > >and the VT-d is constructed - it can be assured that the ranges are valid. > > > > > >I am not an expert on the P2M code - but I think that would need to be looked > > >at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN. > > > > We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug, > > lable support (namespace)... > > <hand-waves> I don't know what QEMU does for guests? I naively assumed it would > create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have > the _DSM). > ACPI 6 defines E820 type 7 for pmem (see table 15-312 in Section 15) and legacy ones may use the non-standard type 12 (and even older ones may use type 6, but linux does not consider type 6 any more), but hot-plugged NVDIMM may not appear in E820. Still think it's better to let dom0 linux that already has enough drivers handle all these device probing tasks. > Either way what I think you need to investigate is what is neccessary for the > Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for > the NVDIMM. Based on that - you will know what kind of exposure the hypervisor > needs to the _FIT and NFIT tables. > > (Adding Feng Wu, the VT-d maintainer). I haven't considered VT-d at all. From your example in another reply, it looks like that VT-d code needs to be aware of the address space range of NVDIMM, otherwise that example would not work. If so, maybe we can let dom0 linux kernel report the address space ranges of detected NVDIMM devices to Xen hypervisor. Anyway, I'll investigate this issue. Haozhong
On 01/20/2016 11:41 PM, Konrad Rzeszutek Wilk wrote: >>>>> Neither of these are sufficient however. That gets Qemu a mapping of >>>>> the NVDIMM, not the guest. Something, one way or another, has to turn >>>>> this into appropriate add-to-phymap hypercalls. >>>>> >>>> >>>> Yes, those hypercalls are what I'm going to add. >>> >>> Why? >>> >>> What you need (in a rought hand-wave way) is to: >>> - mount /dev/pmem0 >>> - mmap the file on /dev/pmem0 FS >>> - walk the VMA for the file - extract the MFN (machien frame numbers) >> If I understand right, in this case the MFN is the block layout of the DAX-file? If we find all the file blocks, then we get all the MFN. >> Can this step be done by QEMU? Or does linux kernel provide some >> approach for the userspace to do the translation? > The ioctl(fd, FIBMAP, &block) may help, which can get the LBAs that a given file occupies. -Bob > I don't know. I would think no - as you wouldn't want the userspace > application to figure out the physical frames from the virtual > address (unless they are root). But then if you look in > /proc/<pid>/maps and /proc/<pid>/smaps there are some data there. > > Hm, /proc/<pid>/pagemaps has something intersting > > See pagemap_read function. That looks to be doing it? > >> >> Haozhong >> >>> - feed those frame numbers to xc_memory_mapping hypercall. The >>> guest pfns would be contingous. >>> Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on >>> /dev/pmem0 FS - the guest pfns are 0x200000 upward. >>> >>> However the MFNs may be discontingous as the NVDIMM could be an >>> 1TB - and the 8GB file is scattered all over. >>> >>> I believe that is all you would need to do?
>>> On 20.01.16 at 18:17, <guangrong.xiao@linux.intel.com> wrote: > > On 01/21/2016 01:07 AM, Jan Beulich wrote: >>>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote: >>> On 01/20/2016 07:20 PM, Jan Beulich wrote: >>>> To answer this I need to have my understanding of the partitioning >>>> being done by firmware confirmed: If that's the case, then "normal" >>>> means the part that doesn't get exposed as a block device (SSD). >>>> In any event there's no correlation to guest exposure here. >>> >>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled >>> by OS. >>> >>> Actually, there are lots of things we should take into account if we move >>> the NVDIMM management to hypervisor: >>> a) ACPI NFIT interpretation >>> A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the >>> base information of NVDIMM devices which includes PMEM info, PBLK >>> info, nvdimm device interleave, vendor info, etc. Let me explain it one >>> by one. >>> >>> PMEM and PBLK are two modes to access NVDIMM devices: >>> 1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address >>> space so that CPU can r/w it directly. >>> 2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM >>> only offers two windows which are mapped to CPU's address space, the data >>> window and access window, so that CPU can use these two windows to access >>> the whole NVDIMM device. >> >> You fail to mention PBLK. The question above really was about what > > The 2) is PBLK. > >> entity controls which of the two modes get used (and perhaps for >> which parts of the overall NVDIMM). > > So i think the "normal" you mentioned is about PMEM. :) Yes. But then - other than you said above - it still looks to me as if the split between PMEM and PBLK is arranged for by firmware? Jan
On 01/21/2016 04:18 PM, Jan Beulich wrote: >>>> On 20.01.16 at 18:17, <guangrong.xiao@linux.intel.com> wrote: > >> >> On 01/21/2016 01:07 AM, Jan Beulich wrote: >>>>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote: >>>> On 01/20/2016 07:20 PM, Jan Beulich wrote: >>>>> To answer this I need to have my understanding of the partitioning >>>>> being done by firmware confirmed: If that's the case, then "normal" >>>>> means the part that doesn't get exposed as a block device (SSD). >>>>> In any event there's no correlation to guest exposure here. >>>> >>>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled >>>> by OS. >>>> >>>> Actually, there are lots of things we should take into account if we move >>>> the NVDIMM management to hypervisor: >>>> a) ACPI NFIT interpretation >>>> A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the >>>> base information of NVDIMM devices which includes PMEM info, PBLK >>>> info, nvdimm device interleave, vendor info, etc. Let me explain it one >>>> by one. >>>> >>>> PMEM and PBLK are two modes to access NVDIMM devices: >>>> 1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address >>>> space so that CPU can r/w it directly. >>>> 2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM >>>> only offers two windows which are mapped to CPU's address space, the data >>>> window and access window, so that CPU can use these two windows to access >>>> the whole NVDIMM device. >>> >>> You fail to mention PBLK. The question above really was about what >> >> The 2) is PBLK. >> >>> entity controls which of the two modes get used (and perhaps for >>> which parts of the overall NVDIMM). >> >> So i think the "normal" you mentioned is about PMEM. :) > > Yes. But then - other than you said above - it still looks to me as > if the split between PMEM and PBLK is arranged for by firmware? Yes. But OS/Hypervisor is not excepted to dynamically change its configure (re-split), i,e, for PoV of OS/Hypervisor, it is static.
>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote: > On 01/21/2016 04:18 PM, Jan Beulich wrote: >> Yes. But then - other than you said above - it still looks to me as >> if the split between PMEM and PBLK is arranged for by firmware? > > Yes. But OS/Hypervisor is not excepted to dynamically change its configure > (re-split), > i,e, for PoV of OS/Hypervisor, it is static. Exactly, that has been my understanding. And hence the PMEM part could be under the hypervisor's control, while the PBLK part could be Dom0's responsibility. Jan
On 01/21/2016 04:53 PM, Jan Beulich wrote: >>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote: >> On 01/21/2016 04:18 PM, Jan Beulich wrote: >>> Yes. But then - other than you said above - it still looks to me as >>> if the split between PMEM and PBLK is arranged for by firmware? >> >> Yes. But OS/Hypervisor is not excepted to dynamically change its configure >> (re-split), >> i,e, for PoV of OS/Hypervisor, it is static. > > Exactly, that has been my understanding. And hence the PMEM part > could be under the hypervisor's control, while the PBLK part could be > Dom0's responsibility. > I am not sure if i have understood your point. What your suggestion is that leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to Dom0? If yes, we should: a) handle hotplug in hypervisor (new PMEM add/remove) that causes hyperivsor interpret ACPI SSDT/DSDT. b) some _DSMs control PMEM so you should filter out these kind of _DSMs and handle them in hypervisor. c) hypervisor should mange PMEM resource pool and partition it to multiple VMs.
On 21/01/16 09:10, Xiao Guangrong wrote: > > > On 01/21/2016 04:53 PM, Jan Beulich wrote: >>>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote: >>> On 01/21/2016 04:18 PM, Jan Beulich wrote: >>>> Yes. But then - other than you said above - it still looks to me as >>>> if the split between PMEM and PBLK is arranged for by firmware? >>> >>> Yes. But OS/Hypervisor is not excepted to dynamically change its >>> configure >>> (re-split), >>> i,e, for PoV of OS/Hypervisor, it is static. >> >> Exactly, that has been my understanding. And hence the PMEM part >> could be under the hypervisor's control, while the PBLK part could be >> Dom0's responsibility. >> > > I am not sure if i have understood your point. What your suggestion is > that > leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to > Dom0? If yes, we should: > a) handle hotplug in hypervisor (new PMEM add/remove) that causes > hyperivsor > interpret ACPI SSDT/DSDT. > b) some _DSMs control PMEM so you should filter out these kind of > _DSMs and > handle them in hypervisor. > c) hypervisor should mange PMEM resource pool and partition it to > multiple > VMs. It is not possible for Xen to handle ACPI such as this. There can only be one OSPM on a system, and 9/10ths of the functionality needing it already lives in Dom0. The only rational course of action is for Xen to treat both PBLK and PMEM as "devices" and leave them in Dom0's hands. ~Andrew
>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote: > On 01/21/2016 04:53 PM, Jan Beulich wrote: >>>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote: >>> On 01/21/2016 04:18 PM, Jan Beulich wrote: >>>> Yes. But then - other than you said above - it still looks to me as >>>> if the split between PMEM and PBLK is arranged for by firmware? >>> >>> Yes. But OS/Hypervisor is not excepted to dynamically change its configure >>> (re-split), >>> i,e, for PoV of OS/Hypervisor, it is static. >> >> Exactly, that has been my understanding. And hence the PMEM part >> could be under the hypervisor's control, while the PBLK part could be >> Dom0's responsibility. >> > > I am not sure if i have understood your point. What your suggestion is that > leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to > Dom0? If yes, we should: > a) handle hotplug in hypervisor (new PMEM add/remove) that causes hyperivsor > interpret ACPI SSDT/DSDT. Why would this be different from ordinary memory hotplug, where Dom0 deals with the ACPI CA interaction, notifying Xen about the added memory? > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and > handle them in hypervisor. Not if (see above) following the model we currently have in place. > c) hypervisor should mange PMEM resource pool and partition it to multiple > VMs. Yes. Jan
>>> On 21.01.16 at 10:29, <andrew.cooper3@citrix.com> wrote: > On 21/01/16 09:10, Xiao Guangrong wrote: >> I am not sure if i have understood your point. What your suggestion is >> that >> leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to >> Dom0? If yes, we should: >> a) handle hotplug in hypervisor (new PMEM add/remove) that causes >> hyperivsor >> interpret ACPI SSDT/DSDT. >> b) some _DSMs control PMEM so you should filter out these kind of >> _DSMs and >> handle them in hypervisor. >> c) hypervisor should mange PMEM resource pool and partition it to >> multiple >> VMs. > > It is not possible for Xen to handle ACPI such as this. > > There can only be one OSPM on a system, and 9/10ths of the functionality > needing it already lives in Dom0. > > The only rational course of action is for Xen to treat both PBLK and > PMEM as "devices" and leave them in Dom0's hands. See my other reply: Why would this be different from "ordinary" memory hotplug? Jan
On 01/21/16 03:25, Jan Beulich wrote: > >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote: > > On 01/21/2016 04:53 PM, Jan Beulich wrote: > >>>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote: > >>> On 01/21/2016 04:18 PM, Jan Beulich wrote: > >>>> Yes. But then - other than you said above - it still looks to me as > >>>> if the split between PMEM and PBLK is arranged for by firmware? > >>> > >>> Yes. But OS/Hypervisor is not excepted to dynamically change its configure > >>> (re-split), > >>> i,e, for PoV of OS/Hypervisor, it is static. > >> > >> Exactly, that has been my understanding. And hence the PMEM part > >> could be under the hypervisor's control, while the PBLK part could be > >> Dom0's responsibility. > >> > > > > I am not sure if i have understood your point. What your suggestion is that > > leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to > > Dom0? If yes, we should: > > a) handle hotplug in hypervisor (new PMEM add/remove) that causes hyperivsor > > interpret ACPI SSDT/DSDT. > > Why would this be different from ordinary memory hotplug, where > Dom0 deals with the ACPI CA interaction, notifying Xen about the > added memory? > The process of NVDIMM hotplug is similar to the ordinary memory hotplug, and seemingly possible to support it in Xen hypervisor like ordinary memory hotplug. > > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and > > handle them in hypervisor. > > Not if (see above) following the model we currently have in place. > You mean let dom0 linux evaluates those _DSMs and interact with hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)? > > c) hypervisor should mange PMEM resource pool and partition it to multiple > > VMs. > > Yes. > But I Still do not quite understand this part: why must pmem resource management and partition be done in hypervisor? I mean if we allow the following steps of operations (for example) (1) partition pmem in dom 0 (2) get address and size of each partition (part_addr, part_size) (3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size, gpfn) to map a partition to the address gpfn in dom d. Only the last step requires hypervisor. Would anything be wrong if we allow above operations? Ha
>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote: > On 01/21/16 03:25, Jan Beulich wrote: >> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote: >> > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and >> > handle them in hypervisor. >> >> Not if (see above) following the model we currently have in place. >> > > You mean let dom0 linux evaluates those _DSMs and interact with > hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)? Yes. >> > c) hypervisor should mange PMEM resource pool and partition it to multiple >> > VMs. >> >> Yes. >> > > But I Still do not quite understand this part: why must pmem resource > management and partition be done in hypervisor? Because that's where memory management belongs. And PMEM, other than PBLK, is just another form of RAM. > I mean if we allow the following steps of operations (for example) > (1) partition pmem in dom 0 > (2) get address and size of each partition (part_addr, part_size) > (3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size, > gpfn) to > map a partition to the address gpfn in dom d. > Only the last step requires hypervisor. Would anything be wrong if we > allow above operations? The main issue is that this would imo be a layering violation. I'm sure it can be made work, but that doesn't mean that's the way it ought to work. Jan
On 01/21/16 07:52, Jan Beulich wrote: > >>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote: > > On 01/21/16 03:25, Jan Beulich wrote: > >> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote: > >> > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and > >> > handle them in hypervisor. > >> > >> Not if (see above) following the model we currently have in place. > >> > > > > You mean let dom0 linux evaluates those _DSMs and interact with > > hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)? > > Yes. > > >> > c) hypervisor should mange PMEM resource pool and partition it to multiple > >> > VMs. > >> > >> Yes. > >> > > > > But I Still do not quite understand this part: why must pmem resource > > management and partition be done in hypervisor? > > Because that's where memory management belongs. And PMEM, > other than PBLK, is just another form of RAM. > > > I mean if we allow the following steps of operations (for example) > > (1) partition pmem in dom 0 > > (2) get address and size of each partition (part_addr, part_size) > > (3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size, > > gpfn) to > > map a partition to the address gpfn in dom d. > > Only the last step requires hypervisor. Would anything be wrong if we > > allow above operations? > > The main issue is that this would imo be a layering violation. I'm > sure it can be made work, but that doesn't mean that's the way > it ought to work. > > Jan > OK, then it makes sense to put them in hypervisor. I'll think about this and note in the design document. Thanks, Haozhong
On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote: >>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote: >> On 01/21/16 03:25, Jan Beulich wrote: >>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote: >>> > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and >>> > handle them in hypervisor. >>> >>> Not if (see above) following the model we currently have in place. >>> >> >> You mean let dom0 linux evaluates those _DSMs and interact with >> hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)? > > Yes. > >>> > c) hypervisor should mange PMEM resource pool and partition it to multiple >>> > VMs. >>> >>> Yes. >>> >> >> But I Still do not quite understand this part: why must pmem resource >> management and partition be done in hypervisor? > > Because that's where memory management belongs. And PMEM, > other than PBLK, is just another form of RAM. I haven't looked more deeply into the details of this, but this argument doesn't seem right to me. Normal RAM in Xen is what might be called "fungible" -- at boot, all RAM is zeroed, and it basically doesn't matter at all what RAM is given to what guest. (There are restrictions of course: lowmem for DMA, contiguous superpages, &c; but within those groups, it doesn't matter *which* bit of lowmem you get, as long as you get enough to do your job.) If you reboot your guest or hand RAM back to the hypervisor, you assume that everything in it will disappear. When you ask for RAM, you can request some parameters that it will have (lowmem, on a specific node, &c), but you can't request a specific page that you had before. This is not the case for PMEM. The whole point of PMEM (correct me if I'm wrong) is to be used for long-term storage that survives over reboot. It matters very much that a guest be given the same PRAM after the host is rebooted that it was given before. It doesn't make any sense to manage it the way Xen currently manages RAM (i.e., that you request a page and get whatever Xen happens to give you). So if Xen is going to use PMEM, it will have to invent an entirely new interface for guests, and it will have to keep track of those resources across host reboots. In other words, it will have to duplicate all the work that Linux already does. What do we gain from that duplication? Why not just leverage what's already implemented in dom0? >> I mean if we allow the following steps of operations (for example) >> (1) partition pmem in dom 0 >> (2) get address and size of each partition (part_addr, part_size) >> (3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size, >> gpfn) to >> map a partition to the address gpfn in dom d. >> Only the last step requires hypervisor. Would anything be wrong if we >> allow above operations? > > The main issue is that this would imo be a layering violation. I'm > sure it can be made work, but that doesn't mean that's the way > it ought to work. Jan, from a toolstack <-> Xen perspective, I'm not sure what alternative there to the interface above. Won't the toolstack have to 1) figure out what nvdimm regions there are and 2) tell Xen how and where to assign them to the guest no matter what we do? And if we want to assign arbitrary regions to arbitrary guests, then (part_addr, part_size) and (gpfn) are going to be necessary bits of information. The only difference would be whether part_addr is the machine address or some abstracted address space (possibly starting at 0). What does your ideal toolstack <-> Xen interface look like? -George
>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote: > On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote: >>> On 01/21/16 03:25, Jan Beulich wrote: >>>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote: >>>> > c) hypervisor should mange PMEM resource pool and partition it to multiple >>>> > VMs. >>>> >>>> Yes. >>>> >>> >>> But I Still do not quite understand this part: why must pmem resource >>> management and partition be done in hypervisor? >> >> Because that's where memory management belongs. And PMEM, >> other than PBLK, is just another form of RAM. > > I haven't looked more deeply into the details of this, but this > argument doesn't seem right to me. > > Normal RAM in Xen is what might be called "fungible" -- at boot, all > RAM is zeroed, and it basically doesn't matter at all what RAM is > given to what guest. (There are restrictions of course: lowmem for > DMA, contiguous superpages, &c; but within those groups, it doesn't > matter *which* bit of lowmem you get, as long as you get enough to do > your job.) If you reboot your guest or hand RAM back to the > hypervisor, you assume that everything in it will disappear. When you > ask for RAM, you can request some parameters that it will have > (lowmem, on a specific node, &c), but you can't request a specific > page that you had before. > > This is not the case for PMEM. The whole point of PMEM (correct me if > I'm wrong) is to be used for long-term storage that survives over > reboot. It matters very much that a guest be given the same PRAM > after the host is rebooted that it was given before. It doesn't make > any sense to manage it the way Xen currently manages RAM (i.e., that > you request a page and get whatever Xen happens to give you). Interesting. This isn't the usage model I have been thinking about so far. Having just gone back to the original 0/4 mail, I'm afraid we're really left guessing, and you guessed differently than I did. My understanding of the intentions of PMEM so far was that this is a high-capacity, slower than DRAM but much faster than e.g. swapping to disk alternative to normal RAM. I.e. the persistent aspect of it wouldn't matter at all in this case (other than for PBLK, obviously). However, thinking through your usage model I have problems seeing it work in a reasonable way even with virtualization left aside: To my knowledge there's no established protocol on how multiple parties (different versions of the same OS, or even completely different OSes) would arbitrate using such memory ranges. And even for a single OS it is, other than for disks (and hence PBLK), not immediately clear how it would communicate from one boot to another what information got stored where, or how it would react to some or all of this storage having disappeared (just like a disk which got removed, which - unless it held the boot partition - would normally have pretty little effect on the OS coming back up). > So if Xen is going to use PMEM, it will have to invent an entirely new > interface for guests, and it will have to keep track of those > resources across host reboots. In other words, it will have to > duplicate all the work that Linux already does. What do we gain from > that duplication? Why not just leverage what's already implemented in > dom0? Indeed if my guessing on the intentions was wrong, then the picture completely changes (also for the points you've made further down). Jan
On 26/01/16 13:44, Jan Beulich wrote: >>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote: >> On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote: >>>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote: >>>> On 01/21/16 03:25, Jan Beulich wrote: >>>>>>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote: >>>>>> c) hypervisor should mange PMEM resource pool and partition it to multiple >>>>>> VMs. >>>>> >>>>> Yes. >>>>> >>>> >>>> But I Still do not quite understand this part: why must pmem resource >>>> management and partition be done in hypervisor? >>> >>> Because that's where memory management belongs. And PMEM, >>> other than PBLK, is just another form of RAM. >> >> I haven't looked more deeply into the details of this, but this >> argument doesn't seem right to me. >> >> Normal RAM in Xen is what might be called "fungible" -- at boot, all >> RAM is zeroed, and it basically doesn't matter at all what RAM is >> given to what guest. (There are restrictions of course: lowmem for >> DMA, contiguous superpages, &c; but within those groups, it doesn't >> matter *which* bit of lowmem you get, as long as you get enough to do >> your job.) If you reboot your guest or hand RAM back to the >> hypervisor, you assume that everything in it will disappear. When you >> ask for RAM, you can request some parameters that it will have >> (lowmem, on a specific node, &c), but you can't request a specific >> page that you had before. >> >> This is not the case for PMEM. The whole point of PMEM (correct me if >> I'm wrong) is to be used for long-term storage that survives over >> reboot. It matters very much that a guest be given the same PRAM >> after the host is rebooted that it was given before. It doesn't make >> any sense to manage it the way Xen currently manages RAM (i.e., that >> you request a page and get whatever Xen happens to give you). > > Interesting. This isn't the usage model I have been thinking about > so far. Having just gone back to the original 0/4 mail, I'm afraid > we're really left guessing, and you guessed differently than I did. > My understanding of the intentions of PMEM so far was that this > is a high-capacity, slower than DRAM but much faster than e.g. > swapping to disk alternative to normal RAM. I.e. the persistent > aspect of it wouldn't matter at all in this case (other than for PBLK, > obviously). > > However, thinking through your usage model I have problems > seeing it work in a reasonable way even with virtualization left > aside: To my knowledge there's no established protocol on how > multiple parties (different versions of the same OS, or even > completely different OSes) would arbitrate using such memory > ranges. And even for a single OS it is, other than for disks (and > hence PBLK), not immediately clear how it would communicate > from one boot to another what information got stored where, > or how it would react to some or all of this storage having > disappeared (just like a disk which got removed, which - unless > it held the boot partition - would normally have pretty little > effect on the OS coming back up). Last year at Linux Plumbers Conference I attended a session dedicated to NVDIMM support. I asked the very same question and the INTEL guy there told me there is indeed something like a partition table meant to describe the layout of the memory areas and their contents. It would be nice to have a pointer to such information. Without anything like this it might be rather difficult to find the best solution how to implement NVDIMM support in Xen or any other product. Juergen
On 26/01/16 12:44, Jan Beulich wrote: >>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote: >> On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote: >>>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote: >>>> On 01/21/16 03:25, Jan Beulich wrote: >>>>>>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote: >>>>>> c) hypervisor should mange PMEM resource pool and partition it to multiple >>>>>> VMs. >>>>> >>>>> Yes. >>>>> >>>> >>>> But I Still do not quite understand this part: why must pmem resource >>>> management and partition be done in hypervisor? >>> >>> Because that's where memory management belongs. And PMEM, >>> other than PBLK, is just another form of RAM. >> >> I haven't looked more deeply into the details of this, but this >> argument doesn't seem right to me. >> >> Normal RAM in Xen is what might be called "fungible" -- at boot, all >> RAM is zeroed, and it basically doesn't matter at all what RAM is >> given to what guest. (There are restrictions of course: lowmem for >> DMA, contiguous superpages, &c; but within those groups, it doesn't >> matter *which* bit of lowmem you get, as long as you get enough to do >> your job.) If you reboot your guest or hand RAM back to the >> hypervisor, you assume that everything in it will disappear. When you >> ask for RAM, you can request some parameters that it will have >> (lowmem, on a specific node, &c), but you can't request a specific >> page that you had before. >> >> This is not the case for PMEM. The whole point of PMEM (correct me if >> I'm wrong) is to be used for long-term storage that survives over >> reboot. It matters very much that a guest be given the same PRAM >> after the host is rebooted that it was given before. It doesn't make >> any sense to manage it the way Xen currently manages RAM (i.e., that >> you request a page and get whatever Xen happens to give you). > > Interesting. This isn't the usage model I have been thinking about > so far. Having just gone back to the original 0/4 mail, I'm afraid > we're really left guessing, and you guessed differently than I did. > My understanding of the intentions of PMEM so far was that this > is a high-capacity, slower than DRAM but much faster than e.g. > swapping to disk alternative to normal RAM. I.e. the persistent > aspect of it wouldn't matter at all in this case (other than for PBLK, > obviously). Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM", then you're right -- it is just another form of RAM, that should be treated no differently than say, lowmem: a fungible resource that can be requested by setting a flag. Haozhong? -George
> Last year at Linux Plumbers Conference I attended a session dedicated > to NVDIMM support. I asked the very same question and the INTEL guy > there told me there is indeed something like a partition table meant > to describe the layout of the memory areas and their contents. It is described in details at pmem.io, look at Documents, see http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section. Then I would recommend you read: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf followed by http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf And then for dessert: https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt which explains it in more technical terms. > > It would be nice to have a pointer to such information. Without anything > like this it might be rather difficult to find the best solution how to > implement NVDIMM support in Xen or any other product.
On Tue, Jan 26, 2016 at 01:58:35PM +0000, George Dunlap wrote: > On 26/01/16 12:44, Jan Beulich wrote: > >>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote: > >> On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote: > >>>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote: > >>>> On 01/21/16 03:25, Jan Beulich wrote: > >>>>>>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote: > >>>>>> c) hypervisor should mange PMEM resource pool and partition it to multiple > >>>>>> VMs. > >>>>> > >>>>> Yes. > >>>>> > >>>> > >>>> But I Still do not quite understand this part: why must pmem resource > >>>> management and partition be done in hypervisor? > >>> > >>> Because that's where memory management belongs. And PMEM, > >>> other than PBLK, is just another form of RAM. > >> > >> I haven't looked more deeply into the details of this, but this > >> argument doesn't seem right to me. > >> > >> Normal RAM in Xen is what might be called "fungible" -- at boot, all > >> RAM is zeroed, and it basically doesn't matter at all what RAM is > >> given to what guest. (There are restrictions of course: lowmem for > >> DMA, contiguous superpages, &c; but within those groups, it doesn't > >> matter *which* bit of lowmem you get, as long as you get enough to do > >> your job.) If you reboot your guest or hand RAM back to the > >> hypervisor, you assume that everything in it will disappear. When you > >> ask for RAM, you can request some parameters that it will have > >> (lowmem, on a specific node, &c), but you can't request a specific > >> page that you had before. > >> > >> This is not the case for PMEM. The whole point of PMEM (correct me if > >> I'm wrong) is to be used for long-term storage that survives over > >> reboot. It matters very much that a guest be given the same PRAM > >> after the host is rebooted that it was given before. It doesn't make > >> any sense to manage it the way Xen currently manages RAM (i.e., that > >> you request a page and get whatever Xen happens to give you). > > > > Interesting. This isn't the usage model I have been thinking about > > so far. Having just gone back to the original 0/4 mail, I'm afraid > > we're really left guessing, and you guessed differently than I did. > > My understanding of the intentions of PMEM so far was that this > > is a high-capacity, slower than DRAM but much faster than e.g. > > swapping to disk alternative to normal RAM. I.e. the persistent > > aspect of it wouldn't matter at all in this case (other than for PBLK, > > obviously). > > Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM", > then you're right -- it is just another form of RAM, that should be > treated no differently than say, lowmem: a fungible resource that can be > requested by setting a flag. I would think of it as MMIO ranges than RAM. Yes it is behind an MMC - but there are subtle things such as the new instructions - pcommit, clfushopt, and other that impact it. Furthermore ranges (contingous and most likely discontingous) of this "RAM" has to be shared with guests (at least dom0) and with other (multiple HVM guests). > > Haozhong? > > -George > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 01/26/16 05:44, Jan Beulich wrote: > >>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote: > > On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote: > >>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote: > >>> On 01/21/16 03:25, Jan Beulich wrote: > >>>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote: > >>>> > c) hypervisor should mange PMEM resource pool and partition it to multiple > >>>> > VMs. > >>>> > >>>> Yes. > >>>> > >>> > >>> But I Still do not quite understand this part: why must pmem resource > >>> management and partition be done in hypervisor? > >> > >> Because that's where memory management belongs. And PMEM, > >> other than PBLK, is just another form of RAM. > > > > I haven't looked more deeply into the details of this, but this > > argument doesn't seem right to me. > > > > Normal RAM in Xen is what might be called "fungible" -- at boot, all > > RAM is zeroed, and it basically doesn't matter at all what RAM is > > given to what guest. (There are restrictions of course: lowmem for > > DMA, contiguous superpages, &c; but within those groups, it doesn't > > matter *which* bit of lowmem you get, as long as you get enough to do > > your job.) If you reboot your guest or hand RAM back to the > > hypervisor, you assume that everything in it will disappear. When you > > ask for RAM, you can request some parameters that it will have > > (lowmem, on a specific node, &c), but you can't request a specific > > page that you had before. > > > > This is not the case for PMEM. The whole point of PMEM (correct me if > > I'm wrong) is to be used for long-term storage that survives over > > reboot. It matters very much that a guest be given the same PRAM > > after the host is rebooted that it was given before. It doesn't make > > any sense to manage it the way Xen currently manages RAM (i.e., that > > you request a page and get whatever Xen happens to give you). > > Interesting. This isn't the usage model I have been thinking about > so far. Having just gone back to the original 0/4 mail, I'm afraid > we're really left guessing, and you guessed differently than I did. > My understanding of the intentions of PMEM so far was that this > is a high-capacity, slower than DRAM but much faster than e.g. > swapping to disk alternative to normal RAM. I.e. the persistent > aspect of it wouldn't matter at all in this case (other than for PBLK, > obviously). > Of course, pmem could be used in the way you thought because of its 'ram' aspect. But I think the more meaningful usage is from its persistent aspect. For example, the implementation of some journal file systems could store logs in pmem rather than the normal ram, so that if a power failure happens before those in-memory logs are completely written to the disk, there would still be chance to restore them from pmem after next booting (rather than abandoning all of them). (I'm still writing the design doc which will include more details of underlying hardware and the software interface of nvdimm exposed by current linux) > However, thinking through your usage model I have problems > seeing it work in a reasonable way even with virtualization left > aside: To my knowledge there's no established protocol on how > multiple parties (different versions of the same OS, or even > completely different OSes) would arbitrate using such memory > ranges. And even for a single OS it is, other than for disks (and > hence PBLK), not immediately clear how it would communicate > from one boot to another what information got stored where, > or how it would react to some or all of this storage having > disappeared (just like a disk which got removed, which - unless > it held the boot partition - would normally have pretty little > effect on the OS coming back up). > Label storage area is a persistent area on NVDIMM and can be used to store partitions information. It's not included in pmem (that part that is mapped into the system address space). Instead, it can be only accessed through NVDIMM _DSM method [1]. However, what contents are stored and how they are interpreted are left to software. One way is to follow NVDIMM Namespace Specification [2] to store an array of labels that describe the start address (from the base 0 of pmem) and the size of each partition, which is called as namespace. On Linux, each namespace is exposed as a /dev/pmemXX device. In the virtualization, the (virtual) label storage area of vNVDIMM and the corresponding _DSM method are emulated by QEMU. The virtual label storage area is not written to the host one. Instead, we can reserve a piece area on pmem for the virtual one. Besides namespaces, we can also create DAX file systems on pmem and use files to partition. Haozhong > > So if Xen is going to use PMEM, it will have to invent an entirely new > > interface for guests, and it will have to keep track of those > > resources across host reboots. In other words, it will have to > > duplicate all the work that Linux already does. What do we gain from > > that duplication? Why not just leverage what's already implemented in > > dom0? > > Indeed if my guessing on the intentions was wrong, then the > picture completely changes (also for the points you've made > further down). > > Jan >
On 01/26/16 23:30, Haozhong Zhang wrote: > On 01/26/16 05:44, Jan Beulich wrote: > > >>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote: > > > On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote: > > >>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote: > > >>> On 01/21/16 03:25, Jan Beulich wrote: > > >>>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote: > > >>>> > c) hypervisor should mange PMEM resource pool and partition it to multiple > > >>>> > VMs. > > >>>> > > >>>> Yes. > > >>>> > > >>> > > >>> But I Still do not quite understand this part: why must pmem resource > > >>> management and partition be done in hypervisor? > > >> > > >> Because that's where memory management belongs. And PMEM, > > >> other than PBLK, is just another form of RAM. > > > > > > I haven't looked more deeply into the details of this, but this > > > argument doesn't seem right to me. > > > > > > Normal RAM in Xen is what might be called "fungible" -- at boot, all > > > RAM is zeroed, and it basically doesn't matter at all what RAM is > > > given to what guest. (There are restrictions of course: lowmem for > > > DMA, contiguous superpages, &c; but within those groups, it doesn't > > > matter *which* bit of lowmem you get, as long as you get enough to do > > > your job.) If you reboot your guest or hand RAM back to the > > > hypervisor, you assume that everything in it will disappear. When you > > > ask for RAM, you can request some parameters that it will have > > > (lowmem, on a specific node, &c), but you can't request a specific > > > page that you had before. > > > > > > This is not the case for PMEM. The whole point of PMEM (correct me if > > > I'm wrong) is to be used for long-term storage that survives over > > > reboot. It matters very much that a guest be given the same PRAM > > > after the host is rebooted that it was given before. It doesn't make > > > any sense to manage it the way Xen currently manages RAM (i.e., that > > > you request a page and get whatever Xen happens to give you). > > > > Interesting. This isn't the usage model I have been thinking about > > so far. Having just gone back to the original 0/4 mail, I'm afraid > > we're really left guessing, and you guessed differently than I did. > > My understanding of the intentions of PMEM so far was that this > > is a high-capacity, slower than DRAM but much faster than e.g. > > swapping to disk alternative to normal RAM. I.e. the persistent > > aspect of it wouldn't matter at all in this case (other than for PBLK, > > obviously). > > > > Of course, pmem could be used in the way you thought because of its > 'ram' aspect. But I think the more meaningful usage is from its > persistent aspect. For example, the implementation of some journal > file systems could store logs in pmem rather than the normal ram, so > that if a power failure happens before those in-memory logs are > completely written to the disk, there would still be chance to restore > them from pmem after next booting (rather than abandoning all of > them). > > (I'm still writing the design doc which will include more details of > underlying hardware and the software interface of nvdimm exposed by > current linux) > > > However, thinking through your usage model I have problems > > seeing it work in a reasonable way even with virtualization left > > aside: To my knowledge there's no established protocol on how > > multiple parties (different versions of the same OS, or even > > completely different OSes) would arbitrate using such memory > > ranges. And even for a single OS it is, other than for disks (and > > hence PBLK), not immediately clear how it would communicate > > from one boot to another what information got stored where, > > or how it would react to some or all of this storage having > > disappeared (just like a disk which got removed, which - unless > > it held the boot partition - would normally have pretty little > > effect on the OS coming back up). > > > > Label storage area is a persistent area on NVDIMM and can be used to > store partitions information. It's not included in pmem (that part > that is mapped into the system address space). Instead, it can be only > accessed through NVDIMM _DSM method [1]. However, what contents are > stored and how they are interpreted are left to software. One way is > to follow NVDIMM Namespace Specification [2] to store an array of > labels that describe the start address (from the base 0 of pmem) and > the size of each partition, which is called as namespace. On Linux, > each namespace is exposed as a /dev/pmemXX device. > > In the virtualization, the (virtual) label storage area of vNVDIMM and > the corresponding _DSM method are emulated by QEMU. The virtual label > storage area is not written to the host one. Instead, we can reserve a > piece area on pmem for the virtual one. > > Besides namespaces, we can also create DAX file systems on pmem and > use files to partition. > Forgot references: [1] NVDIMM DSM Interface Examples, http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf [2] NVDIMM Namespace Specification, http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf > Haozhong > > > > So if Xen is going to use PMEM, it will have to invent an entirely new > > > interface for guests, and it will have to keep track of those > > > resources across host reboots. In other words, it will have to > > > duplicate all the work that Linux already does. What do we gain from > > > that duplication? Why not just leverage what's already implemented in > > > dom0? > > > > Indeed if my guessing on the intentions was wrong, then the > > picture completely changes (also for the points you've made > > further down). > > > > Jan > >
>>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote: >> Last year at Linux Plumbers Conference I attended a session dedicated >> to NVDIMM support. I asked the very same question and the INTEL guy >> there told me there is indeed something like a partition table meant >> to describe the layout of the memory areas and their contents. > > It is described in details at pmem.io, look at Documents, see > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section. Well, that's about how PMEM and PBLK ranges get marked, but not about how use of the space inside a PMEM range is coordinated. Jan
>>> On 26.01.16 at 16:30, <haozhong.zhang@intel.com> wrote: > On 01/26/16 05:44, Jan Beulich wrote: >> Interesting. This isn't the usage model I have been thinking about >> so far. Having just gone back to the original 0/4 mail, I'm afraid >> we're really left guessing, and you guessed differently than I did. >> My understanding of the intentions of PMEM so far was that this >> is a high-capacity, slower than DRAM but much faster than e.g. >> swapping to disk alternative to normal RAM. I.e. the persistent >> aspect of it wouldn't matter at all in this case (other than for PBLK, >> obviously). > > Of course, pmem could be used in the way you thought because of its > 'ram' aspect. But I think the more meaningful usage is from its > persistent aspect. For example, the implementation of some journal > file systems could store logs in pmem rather than the normal ram, so > that if a power failure happens before those in-memory logs are > completely written to the disk, there would still be chance to restore > them from pmem after next booting (rather than abandoning all of > them). Well, that leaves open how that file system would find its log after reboot, or how that log is protected from clobbering by another OS booted in between. >> However, thinking through your usage model I have problems >> seeing it work in a reasonable way even with virtualization left >> aside: To my knowledge there's no established protocol on how >> multiple parties (different versions of the same OS, or even >> completely different OSes) would arbitrate using such memory >> ranges. And even for a single OS it is, other than for disks (and >> hence PBLK), not immediately clear how it would communicate >> from one boot to another what information got stored where, >> or how it would react to some or all of this storage having >> disappeared (just like a disk which got removed, which - unless >> it held the boot partition - would normally have pretty little >> effect on the OS coming back up). > > Label storage area is a persistent area on NVDIMM and can be used to > store partitions information. It's not included in pmem (that part > that is mapped into the system address space). Instead, it can be only > accessed through NVDIMM _DSM method [1]. However, what contents are > stored and how they are interpreted are left to software. One way is > to follow NVDIMM Namespace Specification [2] to store an array of > labels that describe the start address (from the base 0 of pmem) and > the size of each partition, which is called as namespace. On Linux, > each namespace is exposed as a /dev/pmemXX device. According to what I've just read in one of the documents Konrad pointed us to, there can be just one PMEM label per DIMM. Unless I misread of course... Jan
On 01/26/16 08:37, Jan Beulich wrote: > >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote: > >> Last year at Linux Plumbers Conference I attended a session dedicated > >> to NVDIMM support. I asked the very same question and the INTEL guy > >> there told me there is indeed something like a partition table meant > >> to describe the layout of the memory areas and their contents. > > > > It is described in details at pmem.io, look at Documents, see > > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section. > > Well, that's about how PMEM and PBLK ranges get marked, but not > about how use of the space inside a PMEM range is coordinated. > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT table. Namespace to pmem is something like partition table to disk. Haozhong
>>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote: > On 01/26/16 08:37, Jan Beulich wrote: >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote: >> >> Last year at Linux Plumbers Conference I attended a session dedicated >> >> to NVDIMM support. I asked the very same question and the INTEL guy >> >> there told me there is indeed something like a partition table meant >> >> to describe the layout of the memory areas and their contents. >> > >> > It is described in details at pmem.io, look at Documents, see >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section. >> >> Well, that's about how PMEM and PBLK ranges get marked, but not >> about how use of the space inside a PMEM range is coordinated. >> > > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT > table. > Namespace to pmem is something like partition table to disk. But I'm talking about sub-dividing the space inside an individual PMEM range. Jan
On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote: > >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote: > > On 01/26/16 08:37, Jan Beulich wrote: > >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote: > >> >> Last year at Linux Plumbers Conference I attended a session dedicated > >> >> to NVDIMM support. I asked the very same question and the INTEL guy > >> >> there told me there is indeed something like a partition table meant > >> >> to describe the layout of the memory areas and their contents. > >> > > >> > It is described in details at pmem.io, look at Documents, see > >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section. > >> > >> Well, that's about how PMEM and PBLK ranges get marked, but not > >> about how use of the space inside a PMEM range is coordinated. > >> > > > > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT > > table. > > Namespace to pmem is something like partition table to disk. > > But I'm talking about sub-dividing the space inside an individual > PMEM range. The namespaces are it. Once you have done them you can mount the PMEM range under say /dev/pmem0 and then put a filesystem on it (ext4, xfs) - and enable DAX support. The DAX just means that the FS will bypass the page cache and write directly to the virtual address. then one can create giant 'dd' images on this filesystem and pass it to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the blocks (or MFNs) for the contents of the file are most certainly discontingous. > > Jan >
On 01/26/16 08:57, Jan Beulich wrote: > >>> On 26.01.16 at 16:30, <haozhong.zhang@intel.com> wrote: > > On 01/26/16 05:44, Jan Beulich wrote: > >> Interesting. This isn't the usage model I have been thinking about > >> so far. Having just gone back to the original 0/4 mail, I'm afraid > >> we're really left guessing, and you guessed differently than I did. > >> My understanding of the intentions of PMEM so far was that this > >> is a high-capacity, slower than DRAM but much faster than e.g. > >> swapping to disk alternative to normal RAM. I.e. the persistent > >> aspect of it wouldn't matter at all in this case (other than for PBLK, > >> obviously). > > > > Of course, pmem could be used in the way you thought because of its > > 'ram' aspect. But I think the more meaningful usage is from its > > persistent aspect. For example, the implementation of some journal > > file systems could store logs in pmem rather than the normal ram, so > > that if a power failure happens before those in-memory logs are > > completely written to the disk, there would still be chance to restore > > them from pmem after next booting (rather than abandoning all of > > them). > > Well, that leaves open how that file system would find its log > after reboot, or how that log is protected from clobbering by > another OS booted in between. > It would depend on the concrete design of those OS or applications. This is just an example to show a possible usage of the persistent aspect. > >> However, thinking through your usage model I have problems > >> seeing it work in a reasonable way even with virtualization left > >> aside: To my knowledge there's no established protocol on how > >> multiple parties (different versions of the same OS, or even > >> completely different OSes) would arbitrate using such memory > >> ranges. And even for a single OS it is, other than for disks (and > >> hence PBLK), not immediately clear how it would communicate > >> from one boot to another what information got stored where, > >> or how it would react to some or all of this storage having > >> disappeared (just like a disk which got removed, which - unless > >> it held the boot partition - would normally have pretty little > >> effect on the OS coming back up). > > > > Label storage area is a persistent area on NVDIMM and can be used to > > store partitions information. It's not included in pmem (that part > > that is mapped into the system address space). Instead, it can be only > > accessed through NVDIMM _DSM method [1]. However, what contents are > > stored and how they are interpreted are left to software. One way is > > to follow NVDIMM Namespace Specification [2] to store an array of > > labels that describe the start address (from the base 0 of pmem) and > > the size of each partition, which is called as namespace. On Linux, > > each namespace is exposed as a /dev/pmemXX device. > > According to what I've just read in one of the documents Konrad > pointed us to, there can be just one PMEM label per DIMM. Unless > I misread of course... > My mistake, only one pmem label per DIMM. Haozhong
On 01/26/16 14:32, Konrad Rzeszutek Wilk wrote: > On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote: > > >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote: > > > On 01/26/16 08:37, Jan Beulich wrote: > > >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote: > > >> >> Last year at Linux Plumbers Conference I attended a session dedicated > > >> >> to NVDIMM support. I asked the very same question and the INTEL guy > > >> >> there told me there is indeed something like a partition table meant > > >> >> to describe the layout of the memory areas and their contents. > > >> > > > >> > It is described in details at pmem.io, look at Documents, see > > >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section. > > >> > > >> Well, that's about how PMEM and PBLK ranges get marked, but not > > >> about how use of the space inside a PMEM range is coordinated. > > >> > > > > > > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT > > > table. > > > Namespace to pmem is something like partition table to disk. > > > > But I'm talking about sub-dividing the space inside an individual > > PMEM range. > > The namespaces are it. > Because only one persistent memory namespace is allowed for an individual pmem, namespace can not be used to sub-divide. > Once you have done them you can mount the PMEM range under say /dev/pmem0 > and then put a filesystem on it (ext4, xfs) - and enable DAX support. > The DAX just means that the FS will bypass the page cache and write directly > to the virtual address. > > then one can create giant 'dd' images on this filesystem and pass it > to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the blocks > (or MFNs) for the contents of the file are most certainly discontingous. > Though the 'dd' image may occupy discontingous MFNs on host pmem, we can map them to contiguous guest PFNs.
>>> On 26.01.16 at 20:32, <konrad.wilk@oracle.com> wrote: > On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote: >> >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote: >> > On 01/26/16 08:37, Jan Beulich wrote: >> >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote: >> >> >> Last year at Linux Plumbers Conference I attended a session dedicated >> >> >> to NVDIMM support. I asked the very same question and the INTEL guy >> >> >> there told me there is indeed something like a partition table meant >> >> >> to describe the layout of the memory areas and their contents. >> >> > >> >> > It is described in details at pmem.io, look at Documents, see >> >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section. >> >> >> >> Well, that's about how PMEM and PBLK ranges get marked, but not >> >> about how use of the space inside a PMEM range is coordinated. >> >> >> > >> > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT >> > table. >> > Namespace to pmem is something like partition table to disk. >> >> But I'm talking about sub-dividing the space inside an individual >> PMEM range. > > The namespaces are it. > > Once you have done them you can mount the PMEM range under say /dev/pmem0 > and then put a filesystem on it (ext4, xfs) - and enable DAX support. > The DAX just means that the FS will bypass the page cache and write directly > to the virtual address. > > then one can create giant 'dd' images on this filesystem and pass it > to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the > blocks > (or MFNs) for the contents of the file are most certainly discontingous. And what's the advantage of this over PBLK? I.e. why would one want to separate PMEM and PBLK ranges if everything gets used the same way anyway? Jan
On Tue, Jan 26, 2016 at 4:34 PM, Jan Beulich <JBeulich@suse.com> wrote: >>>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote: >> On 01/26/16 08:37, Jan Beulich wrote: >>> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote: >>> >> Last year at Linux Plumbers Conference I attended a session dedicated >>> >> to NVDIMM support. I asked the very same question and the INTEL guy >>> >> there told me there is indeed something like a partition table meant >>> >> to describe the layout of the memory areas and their contents. >>> > >>> > It is described in details at pmem.io, look at Documents, see >>> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section. >>> >>> Well, that's about how PMEM and PBLK ranges get marked, but not >>> about how use of the space inside a PMEM range is coordinated. >>> >> >> How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT >> table. >> Namespace to pmem is something like partition table to disk. > > But I'm talking about sub-dividing the space inside an individual > PMEM range. Well as long as at a high level full PMEM blocks can be allocated / marked to a single OS, then that OS can figure out if / how to further subdivide them (and store information about that subdivision). But in any case, since it seems from what Haozhong and Konrad say, that the point of this *is* in fact to take advantage of the persistence, then it seems like allowing Linux to solve the problem of how to subdivide PMEM blocks and just leveraging their solution would be better than trying to duplicate all that effort inside of Xen. -George
On Wed, Jan 27, 2016 at 03:16:59AM -0700, Jan Beulich wrote: > >>> On 26.01.16 at 20:32, <konrad.wilk@oracle.com> wrote: > > On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote: > >> >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote: > >> > On 01/26/16 08:37, Jan Beulich wrote: > >> >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote: > >> >> >> Last year at Linux Plumbers Conference I attended a session dedicated > >> >> >> to NVDIMM support. I asked the very same question and the INTEL guy > >> >> >> there told me there is indeed something like a partition table meant > >> >> >> to describe the layout of the memory areas and their contents. > >> >> > > >> >> > It is described in details at pmem.io, look at Documents, see > >> >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section. > >> >> > >> >> Well, that's about how PMEM and PBLK ranges get marked, but not > >> >> about how use of the space inside a PMEM range is coordinated. > >> >> > >> > > >> > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT > >> > table. > >> > Namespace to pmem is something like partition table to disk. > >> > >> But I'm talking about sub-dividing the space inside an individual > >> PMEM range. > > > > The namespaces are it. > > > > Once you have done them you can mount the PMEM range under say /dev/pmem0 > > and then put a filesystem on it (ext4, xfs) - and enable DAX support. > > The DAX just means that the FS will bypass the page cache and write directly > > to the virtual address. > > > > then one can create giant 'dd' images on this filesystem and pass it > > to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the > > blocks > > (or MFNs) for the contents of the file are most certainly discontingous. > > And what's the advantage of this over PBLK? I.e. why would one > want to separate PMEM and PBLK ranges if everything gets used > the same way anyway? Speed. PBLK emulates hardware - by having a sliding window of the DIMM. The OS can only write to a ring-buffer with the system address and the payload (64bytes I think?) - and the hardware (or firmware) picks it up and does the writes to NVDIMM. The only motivation behind this is to deal with errors. Normal PMEM writes do not report errors. As in if the media is busted - the hardware will engage its remap logic and write somewhere else - until all of its remap blocks have been exhausted. At that point writes (I presume, not sure) and reads will report an error - but via an #MCE. Part of this Xen design will be how to handle that :-) With an PBLK - I presume the hardware/firmware will read the block after it has written it - and if there are errors it will report it right away. Which means you can easily hook PBLK nicely in RAID setups right away. It will be slower than PMEM, but it does give you the normal error reporting. That is until the MCE#->OS->fs errors logic gets figured out. The MCE# logic code is being developed right now by Tony Luck on LKML - and the last I saw the MCE# has the system address - and the MCE code would tag the pages with some bit so that the applications would get a signal. > > Jan >
diff --git a/tools/firmware/hvmloader/acpi/build.c b/tools/firmware/hvmloader/acpi/build.c index 503648c..72be3e0 100644 --- a/tools/firmware/hvmloader/acpi/build.c +++ b/tools/firmware/hvmloader/acpi/build.c @@ -292,8 +292,10 @@ static struct acpi_20_slit *construct_slit(void) return slit; } -static int construct_passthrough_tables(unsigned long *table_ptrs, - int nr_tables) +static int construct_passthrough_tables_common(unsigned long *table_ptrs, + int nr_tables, + const char *xs_acpi_pt_addr, + const char *xs_acpi_pt_length) { const char *s; uint8_t *acpi_pt_addr; @@ -304,26 +306,28 @@ static int construct_passthrough_tables(unsigned long *table_ptrs, uint32_t total = 0; uint8_t *buffer; - s = xenstore_read(HVM_XS_ACPI_PT_ADDRESS, NULL); + s = xenstore_read(xs_acpi_pt_addr, NULL); if ( s == NULL ) - return 0; + return 0; acpi_pt_addr = (uint8_t*)(uint32_t)strtoll(s, NULL, 0); if ( acpi_pt_addr == NULL ) return 0; - s = xenstore_read(HVM_XS_ACPI_PT_LENGTH, NULL); + s = xenstore_read(xs_acpi_pt_length, NULL); if ( s == NULL ) return 0; acpi_pt_length = (uint32_t)strtoll(s, NULL, 0); for ( nr_added = 0; nr_added < nr_max; nr_added++ ) - { + { if ( (acpi_pt_length - total) < sizeof(struct acpi_header) ) break; header = (struct acpi_header*)acpi_pt_addr; + set_checksum(header, offsetof(struct acpi_header, checksum), + header->length); buffer = mem_alloc(header->length, 16); if ( buffer == NULL ) @@ -338,6 +342,21 @@ static int construct_passthrough_tables(unsigned long *table_ptrs, return nr_added; } +static int construct_passthrough_tables(unsigned long *table_ptrs, + int nr_tables) +{ + return construct_passthrough_tables_common(table_ptrs, nr_tables, + HVM_XS_ACPI_PT_ADDRESS, + HVM_XS_ACPI_PT_LENGTH); +} + +static int construct_dm_tables(unsigned long *table_ptrs, int nr_tables) +{ + return construct_passthrough_tables_common(table_ptrs, nr_tables, + HVM_XS_DM_ACPI_PT_ADDRESS, + HVM_XS_DM_ACPI_PT_LENGTH); +} + static int construct_secondary_tables(unsigned long *table_ptrs, struct acpi_info *info) { @@ -454,6 +473,9 @@ static int construct_secondary_tables(unsigned long *table_ptrs, /* Load any additional tables passed through. */ nr_tables += construct_passthrough_tables(table_ptrs, nr_tables); + /* Load any additional tables from device model */ + nr_tables += construct_dm_tables(table_ptrs, nr_tables); + table_ptrs[nr_tables] = 0; return nr_tables; } diff --git a/xen/include/public/hvm/hvm_xs_strings.h b/xen/include/public/hvm/hvm_xs_strings.h index 146b0b0..4698495 100644 --- a/xen/include/public/hvm/hvm_xs_strings.h +++ b/xen/include/public/hvm/hvm_xs_strings.h @@ -41,6 +41,9 @@ #define HVM_XS_ACPI_PT_ADDRESS "hvmloader/acpi/address" #define HVM_XS_ACPI_PT_LENGTH "hvmloader/acpi/length" +#define HVM_XS_DM_ACPI_PT_ADDRESS "hvmloader/dm-acpi/address" +#define HVM_XS_DM_ACPI_PT_LENGTH "hvmloader/dm-acpi/length" + /* Any number of SMBIOS types can be passed through to an HVM guest using * the following xenstore values. The values specify the guest physical * address and length of a block of SMBIOS structures for hvmloader to use.
NVDIMM devices are detected and configured by software through ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This patch extends the existing mechanism in hvmloader of loading passthrough ACPI tables to load extra ACPI tables built by QEMU. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> --- tools/firmware/hvmloader/acpi/build.c | 34 +++++++++++++++++++++++++++------ xen/include/public/hvm/hvm_xs_strings.h | 3 +++ 2 files changed, 31 insertions(+), 6 deletions(-)