Message ID | 20170526023213.18741-1-haozhong.zhang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang <haozhong.zhang@intel.com> wrote: > Applications in Linux guest that use device-dax never trigger flush > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not > device-dax, QEMU cannot guarantee the persistence of guest writes. > Before solving this flushing problem, QEMU should warn users if the > host backend is not device-dax. I think this needs to be stronger than a "warn" it needs to be explicitly forbidden when it is known to be unsafe. > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> > Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com > --- > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Igor Mammedov <imammedo@redhat.com> > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > Cc: Stefan Hajnoczi <stefanha@gmail.com> > Cc: Dan Williams <dan.j.williams@intel.com> > --- > Resend because the wrong maintainer email address was used. > --- > hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 37 insertions(+) > > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c > index db896b0bb6..c7bb407f33 100644 > --- a/hw/mem/nvdimm.c > +++ b/hw/mem/nvdimm.c > @@ -26,6 +26,7 @@ > #include "qapi/error.h" > #include "qapi/visitor.h" > #include "hw/mem/nvdimm.h" > +#include "qemu/error-report.h" > > static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name, > void *opaque, Error **errp) > @@ -78,12 +79,48 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm) > return &nvdimm->nvdimm_mr; > } > > +static void nvdimm_check_dax(HostMemoryBackend *hostmem) > +{ > + char *mem_path = > + object_property_get_str(OBJECT(hostmem), "mem-path", NULL); > + char *dev_name = NULL, *sysfs_path = NULL; > + bool is_dax = false; > + > + if (!mem_path) { > + goto out; > + } > + > + if (!g_str_has_prefix(mem_path, "/dev/dax")) { > + goto out; > + } What if we're using a symlink to a device-dax instance? The should explicitly be looking up the major / minor number of the device (after following any symlinks) and verifying that it is device-dax by checking /sys/dev/char/$major:$minor refers to a device-dax instance.
On 05/25/17 20:34 -0700, Dan Williams wrote: > On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang > <haozhong.zhang@intel.com> wrote: > > Applications in Linux guest that use device-dax never trigger flush > > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not > > device-dax, QEMU cannot guarantee the persistence of guest writes. > > Before solving this flushing problem, QEMU should warn users if the > > host backend is not device-dax. > > I think this needs to be stronger than a "warn" it needs to be > explicitly forbidden when it is known to be unsafe. > I understand your worry and am not object to your suggestion, but forbidden will change the existing behavior that allows such unsafe usage. Let's wait for other maintainers' comments. > > > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> > > Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com > > --- > > Cc: "Michael S. Tsirkin" <mst@redhat.com> > > Cc: Igor Mammedov <imammedo@redhat.com> > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > > Cc: Stefan Hajnoczi <stefanha@gmail.com> > > Cc: Dan Williams <dan.j.williams@intel.com> > > --- > > Resend because the wrong maintainer email address was used. > > --- > > hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++ > > 1 file changed, 37 insertions(+) > > > > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c > > index db896b0bb6..c7bb407f33 100644 > > --- a/hw/mem/nvdimm.c > > +++ b/hw/mem/nvdimm.c > > @@ -26,6 +26,7 @@ > > #include "qapi/error.h" > > #include "qapi/visitor.h" > > #include "hw/mem/nvdimm.h" > > +#include "qemu/error-report.h" > > > > static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name, > > void *opaque, Error **errp) > > @@ -78,12 +79,48 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm) > > return &nvdimm->nvdimm_mr; > > } > > > > +static void nvdimm_check_dax(HostMemoryBackend *hostmem) > > +{ > > + char *mem_path = > > + object_property_get_str(OBJECT(hostmem), "mem-path", NULL); > > + char *dev_name = NULL, *sysfs_path = NULL; > > + bool is_dax = false; > > + > > + if (!mem_path) { > > + goto out; > > + } > > + > > + if (!g_str_has_prefix(mem_path, "/dev/dax")) { > > + goto out; > > + } > > What if we're using a symlink to a device-dax instance? The should > explicitly be looking up the major / minor number of the device (after > following any symlinks) and verifying that it is device-dax by > checking /sys/dev/char/$major:$minor refers to a device-dax instance. > I'll follow to your suggestion in the next version. Thanks, Haozhong
On Thu, May 25, 2017 at 9:30 PM, Haozhong Zhang <haozhong.zhang@intel.com> wrote: > On 05/25/17 20:34 -0700, Dan Williams wrote: >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang >> <haozhong.zhang@intel.com> wrote: >> > Applications in Linux guest that use device-dax never trigger flush >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not >> > device-dax, QEMU cannot guarantee the persistence of guest writes. >> > Before solving this flushing problem, QEMU should warn users if the >> > host backend is not device-dax. >> >> I think this needs to be stronger than a "warn" it needs to be >> explicitly forbidden when it is known to be unsafe. >> > > I understand your worry and am not object to your suggestion, but > forbidden will change the existing behavior that allows such unsafe > usage. Let's wait for other maintainers' comments. Changing existing behavior is the point. Even if a qemu maintainer thought it was acceptable to allow the dangerous behavior with a warning I would continue to assert that we need to block it by default. I'm not opposed to adding a new configuration option like "allow dangerous pmem" to override the new default behavior, but I think this patch is incorrect to just emit a message.
On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote: > On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang > <haozhong.zhang@intel.com> wrote: > > Applications in Linux guest that use device-dax never trigger flush > > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not > > device-dax, QEMU cannot guarantee the persistence of guest writes. > > Before solving this flushing problem, QEMU should warn users if the > > host backend is not device-dax. > > I think this needs to be stronger than a "warn" it needs to be > explicitly forbidden when it is known to be unsafe. I think users should have the choice in what they want to do - QEMU should not artifically block it. There are plenty of things in QEMU that are potentially unsafe in some usage scenarios, but we just document how to use them in a safe manner & any caveats that apply. Higher level applications above QEMU can then consider how they want to apply a usage policy to meet the needs of their usage scenario. Having an emulated DAX device that doesn't guarantee persistence is no different to having an emulated disk device that never flushes to underlying host storage. Regards, Daniel
On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote: > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote: >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang >> <haozhong.zhang@intel.com> wrote: >> > Applications in Linux guest that use device-dax never trigger flush >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not >> > device-dax, QEMU cannot guarantee the persistence of guest writes. >> > Before solving this flushing problem, QEMU should warn users if the >> > host backend is not device-dax. >> >> I think this needs to be stronger than a "warn" it needs to be >> explicitly forbidden when it is known to be unsafe. > > I think users should have the choice in what they want to do - > QEMU should not artifically block it. There are plenty of things > in QEMU that are potentially unsafe in some usage scenarios, but > we just document how to use them in a safe manner & any caveats > that apply. Higher level applications above QEMU can then consider > how they want to apply a usage policy to meet the needs of their > usage scenario. > > Having an emulated DAX device that doesn't guarantee persistence > is no different to having an emulated disk device that never flushes > to underlying host storage. > It is different in the sense that the contract of when the guest should assume persistence is specified by when the write completes to the virtual disk. In the case of the virtual NFIT we are currently lying to the guest about that platform persistence guarantee even if the hypervisor is emulating pmem with volatile memory. In other words, I agree that it should be possible to tell the guest to assume it is pmem when it is not, but we need more granularity in the configuration to communicate the capabilities correctly to the guest. It seems the NFIT memory device state flag ACPI_NFIT_MEM_NOT_ARMED is a good way to communicate pmem safety to the guest. Can we add a new knob to control the polarity of that flag and make ACPI_NFIT_MEM_NOT_ARMED being set the default case when using regular file mmap and clear the flag by default in the device-dax case?
On 05/26/17 08:25 -0700, Dan Williams wrote: > On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote: > > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote: > >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang > >> <haozhong.zhang@intel.com> wrote: > >> > Applications in Linux guest that use device-dax never trigger flush > >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not > >> > device-dax, QEMU cannot guarantee the persistence of guest writes. > >> > Before solving this flushing problem, QEMU should warn users if the > >> > host backend is not device-dax. > >> > >> I think this needs to be stronger than a "warn" it needs to be > >> explicitly forbidden when it is known to be unsafe. > > > > I think users should have the choice in what they want to do - > > QEMU should not artifically block it. There are plenty of things > > in QEMU that are potentially unsafe in some usage scenarios, but > > we just document how to use them in a safe manner & any caveats > > that apply. Higher level applications above QEMU can then consider > > how they want to apply a usage policy to meet the needs of their > > usage scenario. > > > > Having an emulated DAX device that doesn't guarantee persistence > > is no different to having an emulated disk device that never flushes > > to underlying host storage. > > > > It is different in the sense that the contract of when the guest > should assume persistence is specified by when the write completes to > the virtual disk. In the case of the virtual NFIT we are currently > lying to the guest about that platform persistence guarantee even if > the hypervisor is emulating pmem with volatile memory. > > In other words, I agree that it should be possible to tell the guest > to assume it is pmem when it is not, but we need more granularity in > the configuration to communicate the capabilities correctly to the > guest. It seems the NFIT memory device state flag > ACPI_NFIT_MEM_NOT_ARMED is a good way to communicate pmem safety to > the guest. Can we add a new knob to control the polarity of that flag > and make ACPI_NFIT_MEM_NOT_ARMED being set the default case when using > regular file mmap and clear the flag by default in the device-dax > case? Yes, we can set ACPI_NFIT_MEM_NOT_ARMED if the host backend is not device-dax. Thanks, Haozhong
On Fri, May 26, 2017 at 08:25:20AM -0700, Dan Williams wrote: > On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote: > > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote: > >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang > >> <haozhong.zhang@intel.com> wrote: > >> > Applications in Linux guest that use device-dax never trigger flush > >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not > >> > device-dax, QEMU cannot guarantee the persistence of guest writes. > >> > Before solving this flushing problem, QEMU should warn users if the > >> > host backend is not device-dax. > >> > >> I think this needs to be stronger than a "warn" it needs to be > >> explicitly forbidden when it is known to be unsafe. > > > > I think users should have the choice in what they want to do - > > QEMU should not artifically block it. There are plenty of things > > in QEMU that are potentially unsafe in some usage scenarios, but > > we just document how to use them in a safe manner & any caveats > > that apply. Higher level applications above QEMU can then consider > > how they want to apply a usage policy to meet the needs of their > > usage scenario. > > > > Having an emulated DAX device that doesn't guarantee persistence > > is no different to having an emulated disk device that never flushes > > to underlying host storage. > > > > It is different in the sense that the contract of when the guest > should assume persistence is specified by when the write completes to > the virtual disk. In the case of the virtual NFIT we are currently > lying to the guest about that platform persistence guarantee even if > the hypervisor is emulating pmem with volatile memory. We equally lie to the guest about persistence of disks, when the disk is run with certain cache= settings, or when the disk backing file is on tmpfs. It is simply a choice the mgmt application makes about whether to provide backing storage that is capable of satsifying the guarantees implied by the guest device model. So I'm still not seeing a compelling reason to artifically block this with DAX. Regards, Daniel
On Tue, May 30, 2017 at 2:20 AM, Daniel P. Berrange <berrange@redhat.com> wrote: > On Fri, May 26, 2017 at 08:25:20AM -0700, Dan Williams wrote: >> On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote: >> > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote: >> >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang >> >> <haozhong.zhang@intel.com> wrote: >> >> > Applications in Linux guest that use device-dax never trigger flush >> >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not >> >> > device-dax, QEMU cannot guarantee the persistence of guest writes. >> >> > Before solving this flushing problem, QEMU should warn users if the >> >> > host backend is not device-dax. >> >> >> >> I think this needs to be stronger than a "warn" it needs to be >> >> explicitly forbidden when it is known to be unsafe. >> > >> > I think users should have the choice in what they want to do - >> > QEMU should not artifically block it. There are plenty of things >> > in QEMU that are potentially unsafe in some usage scenarios, but >> > we just document how to use them in a safe manner & any caveats >> > that apply. Higher level applications above QEMU can then consider >> > how they want to apply a usage policy to meet the needs of their >> > usage scenario. >> > >> > Having an emulated DAX device that doesn't guarantee persistence >> > is no different to having an emulated disk device that never flushes >> > to underlying host storage. >> > >> >> It is different in the sense that the contract of when the guest >> should assume persistence is specified by when the write completes to >> the virtual disk. In the case of the virtual NFIT we are currently >> lying to the guest about that platform persistence guarantee even if >> the hypervisor is emulating pmem with volatile memory. > > We equally lie to the guest about persistence of disks, when the > disk is run with certain cache= settings, or when the disk backing file > is on tmpfs. It is simply a choice the mgmt application makes about > whether to provide backing storage that is capable of satsifying the > guarantees implied by the guest device model. So I'm still not seeing > a compelling reason to artifically block this with DAX. Agreed, artificially blocking is not a good path, but for completeness we still need a way to control the ACPI "not armed" health state flag and a sane default for that parameter.
On 05/26/2017 10:32 AM, Haozhong Zhang wrote: > +static void nvdimm_check_dax(HostMemoryBackend *hostmem) > +{ > + char *mem_path = > + object_property_get_str(OBJECT(hostmem), "mem-path", NULL); > + char *dev_name = NULL, *sysfs_path = NULL; > + bool is_dax = false; > + > + if (!mem_path) { > + goto out; > + } > + > + if (!g_str_has_prefix(mem_path, "/dev/dax")) { > + goto out; > + } > + > + dev_name = mem_path + strlen("/dev/"); > + sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name); > + if (access(sysfs_path, F_OK)) { > + goto out; > + } > + > + is_dax = true; > + So only dax raw disk has write-persistence guaranty, a pre-allocated file which locates on a DAX-enabled filesystem can not?
[ adding linux-fsdevel ] On Thu, Jun 1, 2017 at 5:00 AM, Xiao Guangrong <guangrong.xiao@gmail.com> wrote: > > > On 05/26/2017 10:32 AM, Haozhong Zhang wrote: > >> +static void nvdimm_check_dax(HostMemoryBackend *hostmem) >> +{ >> + char *mem_path = >> + object_property_get_str(OBJECT(hostmem), "mem-path", NULL); >> + char *dev_name = NULL, *sysfs_path = NULL; >> + bool is_dax = false; >> + >> + if (!mem_path) { >> + goto out; >> + } >> + >> + if (!g_str_has_prefix(mem_path, "/dev/dax")) { >> + goto out; >> + } >> + >> + dev_name = mem_path + strlen("/dev/"); >> + sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name); >> + if (access(sysfs_path, F_OK)) { >> + goto out; >> + } >> + >> + is_dax = true; >> + > > > So only dax raw disk has write-persistence guaranty, a pre-allocated > file which locates on a DAX-enabled filesystem can not? Correct, it is not guaranteed by any existing filesystem. It may work by accident today on ext4 with a pre-allocated file, but a filesystem is otherwise permitted to drop any writes that have not been synced. Until we get an explicit MMAP_SYNC or METADATA_IMMUTABLE feature into a filesystem the only interface that we can use to reliably pass persistent memory through to a guest is device-dax.
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c index db896b0bb6..c7bb407f33 100644 --- a/hw/mem/nvdimm.c +++ b/hw/mem/nvdimm.c @@ -26,6 +26,7 @@ #include "qapi/error.h" #include "qapi/visitor.h" #include "hw/mem/nvdimm.h" +#include "qemu/error-report.h" static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name, void *opaque, Error **errp) @@ -78,12 +79,48 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm) return &nvdimm->nvdimm_mr; } +static void nvdimm_check_dax(HostMemoryBackend *hostmem) +{ + char *mem_path = + object_property_get_str(OBJECT(hostmem), "mem-path", NULL); + char *dev_name = NULL, *sysfs_path = NULL; + bool is_dax = false; + + if (!mem_path) { + goto out; + } + + if (!g_str_has_prefix(mem_path, "/dev/dax")) { + goto out; + } + + dev_name = mem_path + strlen("/dev/"); + sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name); + if (access(sysfs_path, F_OK)) { + goto out; + } + + is_dax = true; + + out: + if (!is_dax) { + error_report("warning: nvdimm backend %s is not DAX device, " + "unable to guarantee persistence of guest writes", + mem_path ?: "RAM"); + } + + g_free(sysfs_path); + g_free(mem_path); +} + static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp) { MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem, errp); NVDIMMDevice *nvdimm = NVDIMM(dimm); uint64_t align, pmem_size, size = memory_region_size(mr); + nvdimm_check_dax(dimm->hostmem); + align = memory_region_get_alignment(mr); pmem_size = size - nvdimm->label_size;
Applications in Linux guest that use device-dax never trigger flush that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not device-dax, QEMU cannot guarantee the persistence of guest writes. Before solving this flushing problem, QEMU should warn users if the host backend is not device-dax. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com --- Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Cc: Stefan Hajnoczi <stefanha@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> --- Resend because the wrong maintainer email address was used. --- hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+)