mbox series

[v7,0/3] spapr: nvdimm: Introduce spapr-nvdimm device

Message ID 164396252398.109112.13436924292537517470.stgit@ltczzess4.aus.stglabs.ibm.com (mailing list archive)
Headers show
Series spapr: nvdimm: Introduce spapr-nvdimm device | expand

Message

Shivaprasad G Bhat Feb. 4, 2022, 8:15 a.m. UTC
If the device backend is not persistent memory for the nvdimm, there
is need for explicit IO flushes to ensure persistence.

On SPAPR, the issue is addressed by adding a new hcall to request for
an explicit flush from the guest when the backend is not pmem.
So, the approach here is to convey when the hcall flush is required
in a device tree property. The guest once it knows the device needs
explicit flushes, makes the hcall as and when required.

It was suggested to create a new device type to address the
explicit flush for such backends on PPC instead of extending the
generic nvdimm device with new property. So, the patch introduces
the spapr-nvdimm device. The new device inherits the nvdimm device
with the new bahviour such that if the backend has pmem=no, the
device tree property is set by default.

The below demonstration shows the map_sync behavior for non-pmem
backends.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
from spapr-nvdimm with pmem=off, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
[root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
Failed to mmap  with Operation not supported

First patch adds the realize/unrealize call backs to the generic device
for the new device's vmstate registration. The second patch implements
the hcall, adds the necessary vmstate properties to spapr machine structure
for carrying the hcall status during save-restore. The nature of the hcall
being asynchronus, the patch uses aio utilities to offload the flush. The
third patch introduces the spapr-nvdimm device, adds the device tree
property for the guest when spapr-nvdimm is used with pmem=no on the
backend. Also adds new property pmem-override(?, suggest if you have better
name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
on pmem backed devices.

The kernel changes to exploit this hcall is at
https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch

---
v6 - https://lists.gnu.org/archive/html/qemu-devel/2022-02/msg00322.html
Changes from v6:
      - Addressed commen from Daniel.
        Fixed a typo
        Fetch the memory backend FD in the flush_worker_cb(), updated hcall
        return values in the comments description)
      - Updated the signatures.

v5 - https://lists.gnu.org/archive/html/qemu-devel/2021-07/msg01741.html
Changes from v5:
      - Taken care of all comments from David
      - Moved the flush lists from spapr machine into the spapr-nvdimm device
        state structures. So, all corresponding data structures adjusted
	accordingly as required.
      - New property pmem-overrride is added to the spapr-nvdimm device. The
        hcall flushes are allowed when pmem-override is set for the device.
      - The flush for pmem backend devices are made to use pmem_persist().
      - The vmstate structures are also made part of device state instead of
        global spapr.
      - Passing the flush token to destination during migration, I think its
        better than finding, deriving it from the outstanding ones.

v4 - https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg05982.html
Changes from v4:
      - Introduce spapr-nvdimm device with nvdimm device as the parent.
      - The new spapr-nvdimm has no new properties. As this is a new
        device and there is no migration related dependencies to be
        taken care of, the device behavior is made to set the device tree
        property and enable hcall when the device type spapr-nvdimm is
        used with pmem=off
      - Fixed commit messages
      - Added checks to ensure the backend is actualy file and not memory
      - Addressed things pointed out by Eric

v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
Changes from v3:
      - Fixed the forward declaration coding guideline violations in 1st patch.
      - Removed the code waiting for the flushes to complete during migration,
        instead restart the flush worker on destination qemu in post load.
      - Got rid of the randomization of the flush tokens, using simple
        counter.
      - Got rid of the redundant flush state lock, relying on the BQL now.
      - Handling the memory-backend-ram usage
      - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
	Added prevention code using 'writeback' on arm and x86_64.
      - Fixed all the miscellaneous comments.

v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
Changes from v2:
      - Using the thread pool based approach as suggested
      - Moved the async hcall handling code to spapr_nvdimm.c along
        with some simplifications
      - Added vmstate to preserve the hcall status during save-restore
        along with pre_save handler code to complete all ongoning flushes.
      - Added hw_compat magic for sync-dax 'on' on previous machines.
      - Miscellanious minor fixes.

v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
      - Fixed a missed-out unlock
      - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (3):
      nvdimm: Add realize, unrealize callbacks to NVDIMMDevice class
      spapr: nvdimm: Implement H_SCM_FLUSH hcall
      spapr: nvdimm: Introduce spapr-nvdimm device


 hw/mem/nvdimm.c               |  16 ++
 hw/mem/pc-dimm.c              |   5 +
 hw/ppc/spapr.c                |   2 +
 hw/ppc/spapr_nvdimm.c         | 394 ++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h       |   2 +
 include/hw/mem/pc-dimm.h      |   1 +
 include/hw/ppc/spapr.h        |   4 +-
 include/hw/ppc/spapr_nvdimm.h |   1 +
 8 files changed, 424 insertions(+), 1 deletion(-)

--
Signature

Comments

Daniel Henrique Barboza Feb. 7, 2022, 11:53 a.m. UTC | #1
On 2/4/22 05:15, Shivaprasad G Bhat wrote:
> If the device backend is not persistent memory for the nvdimm, there
> is need for explicit IO flushes to ensure persistence.
> 
> On SPAPR, the issue is addressed by adding a new hcall to request for
> an explicit flush from the guest when the backend is not pmem.
> So, the approach here is to convey when the hcall flush is required
> in a device tree property. The guest once it knows the device needs
> explicit flushes, makes the hcall as and when required.
> 
> It was suggested to create a new device type to address the
> explicit flush for such backends on PPC instead of extending the
> generic nvdimm device with new property. So, the patch introduces
> the spapr-nvdimm device. The new device inherits the nvdimm device
> with the new bahviour such that if the backend has pmem=no, the
> device tree property is set by default.
> 
> The below demonstration shows the map_sync behavior for non-pmem
> backends.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
> 
> The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
> from spapr-nvdimm with pmem=off, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> 
> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
> Failed to mmap  with Operation not supported
> 
> First patch adds the realize/unrealize call backs to the generic device
> for the new device's vmstate registration. The second patch implements
> the hcall, adds the necessary vmstate properties to spapr machine structure
> for carrying the hcall status during save-restore. The nature of the hcall
> being asynchronus, the patch uses aio utilities to offload the flush. The
> third patch introduces the spapr-nvdimm device, adds the device tree
> property for the guest when spapr-nvdimm is used with pmem=no on the
> backend. Also adds new property pmem-override(?, suggest if you have better
> name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
> on pmem backed devices.
> 
> The kernel changes to exploit this hcall is at
> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch
> 
> ---

I noted that we have only two nvdimm tests in QEMU, both in tests/qtest/bios-tables-test.c.
It would be a good future improvement to add some spapr-nvdimm tests there as well.


Thanks,


Daniel


> v6 - https://lists.gnu.org/archive/html/qemu-devel/2022-02/msg00322.html
> Changes from v6:
>        - Addressed commen from Daniel.
>          Fixed a typo
>          Fetch the memory backend FD in the flush_worker_cb(), updated hcall
>          return values in the comments description)
>        - Updated the signatures.
> 
> v5 - https://lists.gnu.org/archive/html/qemu-devel/2021-07/msg01741.html
> Changes from v5:
>        - Taken care of all comments from David
>        - Moved the flush lists from spapr machine into the spapr-nvdimm device
>          state structures. So, all corresponding data structures adjusted
> 	accordingly as required.
>        - New property pmem-overrride is added to the spapr-nvdimm device. The
>          hcall flushes are allowed when pmem-override is set for the device.
>        - The flush for pmem backend devices are made to use pmem_persist().
>        - The vmstate structures are also made part of device state instead of
>          global spapr.
>        - Passing the flush token to destination during migration, I think its
>          better than finding, deriving it from the outstanding ones.
> 
> v4 - https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg05982.html
> Changes from v4:
>        - Introduce spapr-nvdimm device with nvdimm device as the parent.
>        - The new spapr-nvdimm has no new properties. As this is a new
>          device and there is no migration related dependencies to be
>          taken care of, the device behavior is made to set the device tree
>          property and enable hcall when the device type spapr-nvdimm is
>          used with pmem=off
>        - Fixed commit messages
>        - Added checks to ensure the backend is actualy file and not memory
>        - Addressed things pointed out by Eric
> 
> v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
> Changes from v3:
>        - Fixed the forward declaration coding guideline violations in 1st patch.
>        - Removed the code waiting for the flushes to complete during migration,
>          instead restart the flush worker on destination qemu in post load.
>        - Got rid of the randomization of the flush tokens, using simple
>          counter.
>        - Got rid of the redundant flush state lock, relying on the BQL now.
>        - Handling the memory-backend-ram usage
>        - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
> 	Added prevention code using 'writeback' on arm and x86_64.
>        - Fixed all the miscellaneous comments.
> 
> v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
> Changes from v2:
>        - Using the thread pool based approach as suggested
>        - Moved the async hcall handling code to spapr_nvdimm.c along
>          with some simplifications
>        - Added vmstate to preserve the hcall status during save-restore
>          along with pre_save handler code to complete all ongoning flushes.
>        - Added hw_compat magic for sync-dax 'on' on previous machines.
>        - Miscellanious minor fixes.
> 
> v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
> Changes from v1
>        - Fixed a missed-out unlock
>        - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token
> 
> Shivaprasad G Bhat (3):
>        nvdimm: Add realize, unrealize callbacks to NVDIMMDevice class
>        spapr: nvdimm: Implement H_SCM_FLUSH hcall
>        spapr: nvdimm: Introduce spapr-nvdimm device
> 
> 
>   hw/mem/nvdimm.c               |  16 ++
>   hw/mem/pc-dimm.c              |   5 +
>   hw/ppc/spapr.c                |   2 +
>   hw/ppc/spapr_nvdimm.c         | 394 ++++++++++++++++++++++++++++++++++
>   include/hw/mem/nvdimm.h       |   2 +
>   include/hw/mem/pc-dimm.h      |   1 +
>   include/hw/ppc/spapr.h        |   4 +-
>   include/hw/ppc/spapr_nvdimm.h |   1 +
>   8 files changed, 424 insertions(+), 1 deletion(-)
> 
> --
> Signature
>
Cédric Le Goater Feb. 18, 2022, 7:44 a.m. UTC | #2
On 2/4/22 09:15, Shivaprasad G Bhat wrote:
> If the device backend is not persistent memory for the nvdimm, there
> is need for explicit IO flushes to ensure persistence.
> 
> On SPAPR, the issue is addressed by adding a new hcall to request for
> an explicit flush from the guest when the backend is not pmem.
> So, the approach here is to convey when the hcall flush is required
> in a device tree property. The guest once it knows the device needs
> explicit flushes, makes the hcall as and when required.
> 
> It was suggested to create a new device type to address the
> explicit flush for such backends on PPC instead of extending the
> generic nvdimm device with new property. So, the patch introduces
> the spapr-nvdimm device. The new device inherits the nvdimm device
> with the new bahviour such that if the backend has pmem=no, the
> device tree property is set by default.
> 
> The below demonstration shows the map_sync behavior for non-pmem
> backends.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
> 
> The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
> from spapr-nvdimm with pmem=off, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> 
> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
> Failed to mmap  with Operation not supported
> 
> First patch adds the realize/unrealize call backs to the generic device
> for the new device's vmstate registration. The second patch implements
> the hcall, adds the necessary vmstate properties to spapr machine structure
> for carrying the hcall status during save-restore. The nature of the hcall
> being asynchronus, the patch uses aio utilities to offload the flush. The
> third patch introduces the spapr-nvdimm device, adds the device tree
> property for the guest when spapr-nvdimm is used with pmem=no on the
> backend. Also adds new property pmem-override(?, suggest if you have better
> name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
> on pmem backed devices.
> 
> The kernel changes to exploit this hcall is at
> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch



Applied for ppc-7.0

Thanks,

C.