diff mbox

[RESEND,1/2] nvdimm: warn if the backend is not a DAX device

Message ID 20170526023213.18741-1-haozhong.zhang@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Haozhong Zhang May 26, 2017, 2:32 a.m. UTC
Applications in Linux guest that use device-dax never trigger flush
that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
device-dax, QEMU cannot guarantee the persistence of guest writes.
Before solving this flushing problem, QEMU should warn users if the
host backend is not device-dax.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com
---
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
Resend because the wrong maintainer email address was used.
---
 hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

Comments

Dan Williams May 26, 2017, 3:34 a.m. UTC | #1
On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> Applications in Linux guest that use device-dax never trigger flush
> that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> device-dax, QEMU cannot guarantee the persistence of guest writes.
> Before solving this flushing problem, QEMU should warn users if the
> host backend is not device-dax.

I think this needs to be stronger than a "warn" it needs to be
explicitly forbidden when it is known to be unsafe.

>
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com
> ---
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> Cc: Stefan Hajnoczi <stefanha@gmail.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
> Resend because the wrong maintainer email address was used.
> ---
>  hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
>
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index db896b0bb6..c7bb407f33 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -26,6 +26,7 @@
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
>  #include "hw/mem/nvdimm.h"
> +#include "qemu/error-report.h"
>
>  static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
>                                    void *opaque, Error **errp)
> @@ -78,12 +79,48 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
>      return &nvdimm->nvdimm_mr;
>  }
>
> +static void nvdimm_check_dax(HostMemoryBackend *hostmem)
> +{
> +    char *mem_path =
> +        object_property_get_str(OBJECT(hostmem), "mem-path", NULL);
> +    char *dev_name = NULL, *sysfs_path = NULL;
> +    bool is_dax = false;
> +
> +    if (!mem_path) {
> +        goto out;
> +    }
> +
> +    if (!g_str_has_prefix(mem_path, "/dev/dax")) {
> +        goto out;
> +    }

What if we're using a symlink to a device-dax instance? The should
explicitly be looking up the major / minor number of the device (after
following any symlinks) and verifying that it is device-dax by
checking /sys/dev/char/$major:$minor refers to a device-dax instance.
Haozhong Zhang May 26, 2017, 4:30 a.m. UTC | #2
On 05/25/17 20:34 -0700, Dan Williams wrote:
> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > Applications in Linux guest that use device-dax never trigger flush
> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> > device-dax, QEMU cannot guarantee the persistence of guest writes.
> > Before solving this flushing problem, QEMU should warn users if the
> > host backend is not device-dax.
> 
> I think this needs to be stronger than a "warn" it needs to be
> explicitly forbidden when it is known to be unsafe.
>

I understand your worry and am not object to your suggestion, but
forbidden will change the existing behavior that allows such unsafe
usage. Let's wait for other maintainers' comments.

> >
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com
> > ---
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Igor Mammedov <imammedo@redhat.com>
> > Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> > Cc: Stefan Hajnoczi <stefanha@gmail.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > ---
> > Resend because the wrong maintainer email address was used.
> > ---
> >  hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++
> >  1 file changed, 37 insertions(+)
> >
> > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> > index db896b0bb6..c7bb407f33 100644
> > --- a/hw/mem/nvdimm.c
> > +++ b/hw/mem/nvdimm.c
> > @@ -26,6 +26,7 @@
> >  #include "qapi/error.h"
> >  #include "qapi/visitor.h"
> >  #include "hw/mem/nvdimm.h"
> > +#include "qemu/error-report.h"
> >
> >  static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
> >                                    void *opaque, Error **errp)
> > @@ -78,12 +79,48 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
> >      return &nvdimm->nvdimm_mr;
> >  }
> >
> > +static void nvdimm_check_dax(HostMemoryBackend *hostmem)
> > +{
> > +    char *mem_path =
> > +        object_property_get_str(OBJECT(hostmem), "mem-path", NULL);
> > +    char *dev_name = NULL, *sysfs_path = NULL;
> > +    bool is_dax = false;
> > +
> > +    if (!mem_path) {
> > +        goto out;
> > +    }
> > +
> > +    if (!g_str_has_prefix(mem_path, "/dev/dax")) {
> > +        goto out;
> > +    }
> 
> What if we're using a symlink to a device-dax instance? The should
> explicitly be looking up the major / minor number of the device (after
> following any symlinks) and verifying that it is device-dax by
> checking /sys/dev/char/$major:$minor refers to a device-dax instance.
> 

I'll follow to your suggestion in the next version.

Thanks,
Haozhong
Dan Williams May 26, 2017, 2:28 p.m. UTC | #3
On Thu, May 25, 2017 at 9:30 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> On 05/25/17 20:34 -0700, Dan Williams wrote:
>> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > Applications in Linux guest that use device-dax never trigger flush
>> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
>> > device-dax, QEMU cannot guarantee the persistence of guest writes.
>> > Before solving this flushing problem, QEMU should warn users if the
>> > host backend is not device-dax.
>>
>> I think this needs to be stronger than a "warn" it needs to be
>> explicitly forbidden when it is known to be unsafe.
>>
>
> I understand your worry and am not object to your suggestion, but
> forbidden will change the existing behavior that allows such unsafe
> usage. Let's wait for other maintainers' comments.

Changing existing behavior is the point. Even if a qemu maintainer
thought it was acceptable to allow the dangerous behavior with a
warning I would continue to assert that we need to block it by
default. I'm not opposed to adding a new configuration option like
"allow dangerous pmem" to override the new default behavior, but I
think this patch is incorrect to just emit a message.
Daniel P. Berrangé May 26, 2017, 2:38 p.m. UTC | #4
On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote:
> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > Applications in Linux guest that use device-dax never trigger flush
> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> > device-dax, QEMU cannot guarantee the persistence of guest writes.
> > Before solving this flushing problem, QEMU should warn users if the
> > host backend is not device-dax.
> 
> I think this needs to be stronger than a "warn" it needs to be
> explicitly forbidden when it is known to be unsafe.

I think users should have the choice in what they want to do -
QEMU should not artifically block it.  There are plenty of things
in QEMU that are potentially unsafe in some usage scenarios, but
we just document how to use them in a safe manner & any caveats
that apply. Higher level applications above QEMU can then consider
how they want to apply a usage policy to meet the needs of their
usage scenario.

Having an emulated DAX device that doesn't guarantee persistence
is no different to having an emulated disk device that never flushes
to underlying host storage.

Regards,
Daniel
Dan Williams May 26, 2017, 3:25 p.m. UTC | #5
On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
> On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote:
>> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > Applications in Linux guest that use device-dax never trigger flush
>> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
>> > device-dax, QEMU cannot guarantee the persistence of guest writes.
>> > Before solving this flushing problem, QEMU should warn users if the
>> > host backend is not device-dax.
>>
>> I think this needs to be stronger than a "warn" it needs to be
>> explicitly forbidden when it is known to be unsafe.
>
> I think users should have the choice in what they want to do -
> QEMU should not artifically block it.  There are plenty of things
> in QEMU that are potentially unsafe in some usage scenarios, but
> we just document how to use them in a safe manner & any caveats
> that apply. Higher level applications above QEMU can then consider
> how they want to apply a usage policy to meet the needs of their
> usage scenario.
>
> Having an emulated DAX device that doesn't guarantee persistence
> is no different to having an emulated disk device that never flushes
> to underlying host storage.
>

It is different in the sense that the contract of when the guest
should assume persistence is specified by when the write completes to
the virtual disk. In the case of the virtual NFIT we are currently
lying to the guest about that platform persistence guarantee even if
the hypervisor is emulating pmem with volatile memory.

In other words, I agree that it should be possible to tell the guest
to assume it is pmem when it is not, but we need more granularity in
the configuration to communicate the capabilities correctly to the
guest. It seems the NFIT memory device state flag
ACPI_NFIT_MEM_NOT_ARMED is a good way to communicate pmem safety to
the guest. Can we add a new knob to control the polarity of that flag
and make ACPI_NFIT_MEM_NOT_ARMED being set the default case when using
regular file mmap and clear the flag by default in the device-dax
case?
Haozhong Zhang May 27, 2017, 1:13 a.m. UTC | #6
On 05/26/17 08:25 -0700, Dan Williams wrote:
> On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
> > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote:
> >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
> >> <haozhong.zhang@intel.com> wrote:
> >> > Applications in Linux guest that use device-dax never trigger flush
> >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> >> > device-dax, QEMU cannot guarantee the persistence of guest writes.
> >> > Before solving this flushing problem, QEMU should warn users if the
> >> > host backend is not device-dax.
> >>
> >> I think this needs to be stronger than a "warn" it needs to be
> >> explicitly forbidden when it is known to be unsafe.
> >
> > I think users should have the choice in what they want to do -
> > QEMU should not artifically block it.  There are plenty of things
> > in QEMU that are potentially unsafe in some usage scenarios, but
> > we just document how to use them in a safe manner & any caveats
> > that apply. Higher level applications above QEMU can then consider
> > how they want to apply a usage policy to meet the needs of their
> > usage scenario.
> >
> > Having an emulated DAX device that doesn't guarantee persistence
> > is no different to having an emulated disk device that never flushes
> > to underlying host storage.
> >
> 
> It is different in the sense that the contract of when the guest
> should assume persistence is specified by when the write completes to
> the virtual disk. In the case of the virtual NFIT we are currently
> lying to the guest about that platform persistence guarantee even if
> the hypervisor is emulating pmem with volatile memory.
> 
> In other words, I agree that it should be possible to tell the guest
> to assume it is pmem when it is not, but we need more granularity in
> the configuration to communicate the capabilities correctly to the
> guest. It seems the NFIT memory device state flag
> ACPI_NFIT_MEM_NOT_ARMED is a good way to communicate pmem safety to
> the guest. Can we add a new knob to control the polarity of that flag
> and make ACPI_NFIT_MEM_NOT_ARMED being set the default case when using
> regular file mmap and clear the flag by default in the device-dax
> case?

Yes, we can set ACPI_NFIT_MEM_NOT_ARMED if the host backend is not
device-dax.

Thanks,
Haozhong
Daniel P. Berrangé May 30, 2017, 9:20 a.m. UTC | #7
On Fri, May 26, 2017 at 08:25:20AM -0700, Dan Williams wrote:
> On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
> > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote:
> >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
> >> <haozhong.zhang@intel.com> wrote:
> >> > Applications in Linux guest that use device-dax never trigger flush
> >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> >> > device-dax, QEMU cannot guarantee the persistence of guest writes.
> >> > Before solving this flushing problem, QEMU should warn users if the
> >> > host backend is not device-dax.
> >>
> >> I think this needs to be stronger than a "warn" it needs to be
> >> explicitly forbidden when it is known to be unsafe.
> >
> > I think users should have the choice in what they want to do -
> > QEMU should not artifically block it.  There are plenty of things
> > in QEMU that are potentially unsafe in some usage scenarios, but
> > we just document how to use them in a safe manner & any caveats
> > that apply. Higher level applications above QEMU can then consider
> > how they want to apply a usage policy to meet the needs of their
> > usage scenario.
> >
> > Having an emulated DAX device that doesn't guarantee persistence
> > is no different to having an emulated disk device that never flushes
> > to underlying host storage.
> >
> 
> It is different in the sense that the contract of when the guest
> should assume persistence is specified by when the write completes to
> the virtual disk. In the case of the virtual NFIT we are currently
> lying to the guest about that platform persistence guarantee even if
> the hypervisor is emulating pmem with volatile memory.

We equally lie to the guest about persistence of disks, when the
disk is run with certain cache= settings, or when the disk backing file
is on tmpfs. It is simply a choice the mgmt application makes about
whether to provide backing storage that is capable of satsifying the
guarantees implied by the guest device model. So I'm still not seeing
a compelling reason to artifically block this with DAX.

Regards,
Daniel
Dan Williams May 30, 2017, 11:41 a.m. UTC | #8
On Tue, May 30, 2017 at 2:20 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
> On Fri, May 26, 2017 at 08:25:20AM -0700, Dan Williams wrote:
>> On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
>> > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote:
>> >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
>> >> <haozhong.zhang@intel.com> wrote:
>> >> > Applications in Linux guest that use device-dax never trigger flush
>> >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
>> >> > device-dax, QEMU cannot guarantee the persistence of guest writes.
>> >> > Before solving this flushing problem, QEMU should warn users if the
>> >> > host backend is not device-dax.
>> >>
>> >> I think this needs to be stronger than a "warn" it needs to be
>> >> explicitly forbidden when it is known to be unsafe.
>> >
>> > I think users should have the choice in what they want to do -
>> > QEMU should not artifically block it.  There are plenty of things
>> > in QEMU that are potentially unsafe in some usage scenarios, but
>> > we just document how to use them in a safe manner & any caveats
>> > that apply. Higher level applications above QEMU can then consider
>> > how they want to apply a usage policy to meet the needs of their
>> > usage scenario.
>> >
>> > Having an emulated DAX device that doesn't guarantee persistence
>> > is no different to having an emulated disk device that never flushes
>> > to underlying host storage.
>> >
>>
>> It is different in the sense that the contract of when the guest
>> should assume persistence is specified by when the write completes to
>> the virtual disk. In the case of the virtual NFIT we are currently
>> lying to the guest about that platform persistence guarantee even if
>> the hypervisor is emulating pmem with volatile memory.
>
> We equally lie to the guest about persistence of disks, when the
> disk is run with certain cache= settings, or when the disk backing file
> is on tmpfs. It is simply a choice the mgmt application makes about
> whether to provide backing storage that is capable of satsifying the
> guarantees implied by the guest device model. So I'm still not seeing
> a compelling reason to artifically block this with DAX.

Agreed, artificially blocking is not a good path, but for completeness
we still need a way to control the ACPI "not armed" health state flag
and a sane default for that parameter.
Xiao Guangrong June 1, 2017, noon UTC | #9
On 05/26/2017 10:32 AM, Haozhong Zhang wrote:

> +static void nvdimm_check_dax(HostMemoryBackend *hostmem)
> +{
> +    char *mem_path =
> +        object_property_get_str(OBJECT(hostmem), "mem-path", NULL);
> +    char *dev_name = NULL, *sysfs_path = NULL;
> +    bool is_dax = false;
> +
> +    if (!mem_path) {
> +        goto out;
> +    }
> +
> +    if (!g_str_has_prefix(mem_path, "/dev/dax")) {
> +        goto out;
> +    }
> +
> +    dev_name = mem_path + strlen("/dev/");
> +    sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name);
> +    if (access(sysfs_path, F_OK)) {
> +        goto out;
> +    }
> +
> +    is_dax = true;
> +

So only dax raw disk has write-persistence guaranty, a pre-allocated
file which locates on a DAX-enabled filesystem can not?
Dan Williams June 1, 2017, 1:53 p.m. UTC | #10
[ adding linux-fsdevel ]

On Thu, Jun 1, 2017 at 5:00 AM, Xiao Guangrong <guangrong.xiao@gmail.com> wrote:
>
>
> On 05/26/2017 10:32 AM, Haozhong Zhang wrote:
>
>> +static void nvdimm_check_dax(HostMemoryBackend *hostmem)
>> +{
>> +    char *mem_path =
>> +        object_property_get_str(OBJECT(hostmem), "mem-path", NULL);
>> +    char *dev_name = NULL, *sysfs_path = NULL;
>> +    bool is_dax = false;
>> +
>> +    if (!mem_path) {
>> +        goto out;
>> +    }
>> +
>> +    if (!g_str_has_prefix(mem_path, "/dev/dax")) {
>> +        goto out;
>> +    }
>> +
>> +    dev_name = mem_path + strlen("/dev/");
>> +    sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name);
>> +    if (access(sysfs_path, F_OK)) {
>> +        goto out;
>> +    }
>> +
>> +    is_dax = true;
>> +
>
>
> So only dax raw disk has write-persistence guaranty, a pre-allocated
> file which locates on a DAX-enabled filesystem can not?

Correct, it is not guaranteed by any existing filesystem. It may work
by accident today on ext4 with a pre-allocated file, but a filesystem
is otherwise permitted to drop any writes that have not been synced.
Until we get an explicit MMAP_SYNC or METADATA_IMMUTABLE feature into
a filesystem the only interface that we can use to reliably pass
persistent memory through to a guest is device-dax.
diff mbox

Patch

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index db896b0bb6..c7bb407f33 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -26,6 +26,7 @@ 
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "hw/mem/nvdimm.h"
+#include "qemu/error-report.h"
 
 static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
                                   void *opaque, Error **errp)
@@ -78,12 +79,48 @@  static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
     return &nvdimm->nvdimm_mr;
 }
 
+static void nvdimm_check_dax(HostMemoryBackend *hostmem)
+{
+    char *mem_path =
+        object_property_get_str(OBJECT(hostmem), "mem-path", NULL);
+    char *dev_name = NULL, *sysfs_path = NULL;
+    bool is_dax = false;
+
+    if (!mem_path) {
+        goto out;
+    }
+
+    if (!g_str_has_prefix(mem_path, "/dev/dax")) {
+        goto out;
+    }
+
+    dev_name = mem_path + strlen("/dev/");
+    sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name);
+    if (access(sysfs_path, F_OK)) {
+        goto out;
+    }
+
+    is_dax = true;
+
+ out:
+    if (!is_dax) {
+        error_report("warning: nvdimm backend %s is not DAX device, "
+                     "unable to guarantee persistence of guest writes",
+                     mem_path ?: "RAM");
+    }
+
+    g_free(sysfs_path);
+    g_free(mem_path);
+}
+
 static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
 {
     MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem, errp);
     NVDIMMDevice *nvdimm = NVDIMM(dimm);
     uint64_t align, pmem_size, size = memory_region_size(mr);
 
+    nvdimm_check_dax(dimm->hostmem);
+
     align = memory_region_get_alignment(mr);
 
     pmem_size = size - nvdimm->label_size;