diff mbox series

[1/2] migration: Prioritize RDMA in ram_save_target_page()

Message ID 20250218074345.638203-1-lizhijian@fujitsu.com (mailing list archive)
State New
Headers show
Series [1/2] migration: Prioritize RDMA in ram_save_target_page() | expand

Commit Message

Li Zhijian Feb. 18, 2025, 7:43 a.m. UTC
Address an error in RDMA-based migration by ensuring RDMA is prioritized
when saving pages in `ram_save_target_page()`.

Previously, the RDMA protocol's page-saving step was placed after other
protocols due to a refactoring in commit bc38dc2f5f3. This led to migration
failures characterized by unknown control messages and state loading errors
destination:
(qemu) qemu-system-x86_64: Unknown control message QEMU FILE
qemu-system-x86_64: error while loading state section id 1(ram)
qemu-system-x86_64: load of migration failed: Operation not permitted
source:
(qemu) qemu-system-x86_64: RDMA is in an error state waiting migration to abort!
qemu-system-x86_64: failed to save SaveStateEntry with id(name): 1(ram): -1
qemu-system-x86_64: rdma migration: recv polling control error!
qemu-system-x86_64: warning: Early error. Sending error.
qemu-system-x86_64: warning: rdma migration: send polling control error

RDMA migration implemented its own protocol/method to send pages to
destination side, hand over to RDMA first to prevent pages being saved by
other protocol.

Fixes: bc38dc2f5f3 ("migration: refactor ram_save_target_page functions")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 migration/ram.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Comments

Fabiano Rosas Feb. 18, 2025, 8:30 p.m. UTC | #1
Li Zhijian via <qemu-devel@nongnu.org> writes:

> Address an error in RDMA-based migration by ensuring RDMA is prioritized
> when saving pages in `ram_save_target_page()`.
>
> Previously, the RDMA protocol's page-saving step was placed after other
> protocols due to a refactoring in commit bc38dc2f5f3. This led to migration
> failures characterized by unknown control messages and state loading errors
> destination:
> (qemu) qemu-system-x86_64: Unknown control message QEMU FILE
> qemu-system-x86_64: error while loading state section id 1(ram)
> qemu-system-x86_64: load of migration failed: Operation not permitted
> source:
> (qemu) qemu-system-x86_64: RDMA is in an error state waiting migration to abort!
> qemu-system-x86_64: failed to save SaveStateEntry with id(name): 1(ram): -1
> qemu-system-x86_64: rdma migration: recv polling control error!
> qemu-system-x86_64: warning: Early error. Sending error.
> qemu-system-x86_64: warning: rdma migration: send polling control error
>
> RDMA migration implemented its own protocol/method to send pages to
> destination side, hand over to RDMA first to prevent pages being saved by
> other protocol.
>
> Fixes: bc38dc2f5f3 ("migration: refactor ram_save_target_page functions")
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>  migration/ram.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 6f460fd22d2..635a2fe443a 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1964,6 +1964,11 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
>      ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
>      int res;
>  
> +    /* Hand over to RDMA first */
> +    if (control_save_page(pss, offset, &res)) {
> +        return res;
> +    }
> +

Can we hoist that migrate_rdma() from inside the function? Since the
other paths already check first before calling their functions.

>      if (!migrate_multifd()
>          || migrate_zero_page_detection() == ZERO_PAGE_DETECTION_LEGACY) {
>          if (save_zero_page(rs, pss, offset)) {
> @@ -1976,10 +1981,6 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
>          return ram_save_multifd_page(block, offset);
>      }
>  
> -    if (control_save_page(pss, offset, &res)) {
> -        return res;
> -    }
> -
>      return ram_save_page(rs, pss);
>  }
Peter Xu Feb. 18, 2025, 10:03 p.m. UTC | #2
On Tue, Feb 18, 2025 at 05:30:40PM -0300, Fabiano Rosas wrote:
> Li Zhijian via <qemu-devel@nongnu.org> writes:
> 
> > Address an error in RDMA-based migration by ensuring RDMA is prioritized
> > when saving pages in `ram_save_target_page()`.
> >
> > Previously, the RDMA protocol's page-saving step was placed after other
> > protocols due to a refactoring in commit bc38dc2f5f3. This led to migration
> > failures characterized by unknown control messages and state loading errors
> > destination:
> > (qemu) qemu-system-x86_64: Unknown control message QEMU FILE
> > qemu-system-x86_64: error while loading state section id 1(ram)
> > qemu-system-x86_64: load of migration failed: Operation not permitted
> > source:
> > (qemu) qemu-system-x86_64: RDMA is in an error state waiting migration to abort!
> > qemu-system-x86_64: failed to save SaveStateEntry with id(name): 1(ram): -1
> > qemu-system-x86_64: rdma migration: recv polling control error!
> > qemu-system-x86_64: warning: Early error. Sending error.
> > qemu-system-x86_64: warning: rdma migration: send polling control error
> >
> > RDMA migration implemented its own protocol/method to send pages to
> > destination side, hand over to RDMA first to prevent pages being saved by
> > other protocol.
> >
> > Fixes: bc38dc2f5f3 ("migration: refactor ram_save_target_page functions")
> > Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> > ---
> >  migration/ram.c | 9 +++++----
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 6f460fd22d2..635a2fe443a 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1964,6 +1964,11 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
> >      ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
> >      int res;
> >  
> > +    /* Hand over to RDMA first */
> > +    if (control_save_page(pss, offset, &res)) {
> > +        return res;
> > +    }
> > +
> 
> Can we hoist that migrate_rdma() from inside the function? Since the
> other paths already check first before calling their functions.

If we're talking about hoist and stuff.. and if we want to go slightly
further, I wonder if we could also drop RAM_SAVE_CONTROL_NOT_SUPP.

    if (!migrate_rdma() || migration_in_postcopy()) {
        return RAM_SAVE_CONTROL_NOT_SUPP;
    }

We should make sure rdma_control_save_page() won't get invoked at all in
either case above..  For postcopy, maybe we could fail in the QMP migrate /
migrate_incoming cmd, at migration_channels_and_transport_compatible().

> 
> >      if (!migrate_multifd()
> >          || migrate_zero_page_detection() == ZERO_PAGE_DETECTION_LEGACY) {
> >          if (save_zero_page(rs, pss, offset)) {
> > @@ -1976,10 +1981,6 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
> >          return ram_save_multifd_page(block, offset);
> >      }
> >  
> > -    if (control_save_page(pss, offset, &res)) {
> > -        return res;
> > -    }
> > -
> >      return ram_save_page(rs, pss);
> >  }
>
Zhijian Li (Fujitsu)" via Feb. 19, 2025, 9:39 a.m. UTC | #3
On 19/02/2025 06:03, Peter Xu wrote:
> On Tue, Feb 18, 2025 at 05:30:40PM -0300, Fabiano Rosas wrote:
>> Li Zhijian via <qemu-devel@nongnu.org> writes:
>>
>>> Address an error in RDMA-based migration by ensuring RDMA is prioritized
>>> when saving pages in `ram_save_target_page()`.
>>>
>>> Previously, the RDMA protocol's page-saving step was placed after other
>>> protocols due to a refactoring in commit bc38dc2f5f3. This led to migration
>>> failures characterized by unknown control messages and state loading errors
>>> destination:
>>> (qemu) qemu-system-x86_64: Unknown control message QEMU FILE
>>> qemu-system-x86_64: error while loading state section id 1(ram)
>>> qemu-system-x86_64: load of migration failed: Operation not permitted
>>> source:
>>> (qemu) qemu-system-x86_64: RDMA is in an error state waiting migration to abort!
>>> qemu-system-x86_64: failed to save SaveStateEntry with id(name): 1(ram): -1
>>> qemu-system-x86_64: rdma migration: recv polling control error!
>>> qemu-system-x86_64: warning: Early error. Sending error.
>>> qemu-system-x86_64: warning: rdma migration: send polling control error
>>>
>>> RDMA migration implemented its own protocol/method to send pages to
>>> destination side, hand over to RDMA first to prevent pages being saved by
>>> other protocol.
>>>
>>> Fixes: bc38dc2f5f3 ("migration: refactor ram_save_target_page functions")
>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>>> ---
>>>   migration/ram.c | 9 +++++----
>>>   1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index 6f460fd22d2..635a2fe443a 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -1964,6 +1964,11 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
>>>       ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
>>>       int res;
>>>   
>>> +    /* Hand over to RDMA first */
>>> +    if (control_save_page(pss, offset, &res)) {
>>> +        return res;
>>> +    }
>>> +
>>
>> Can we hoist that migrate_rdma() from inside the function? Since the
>> other paths already check first before calling their functions.
> 

Yeah, it sounds good to me.


> If we're talking about hoist and stuff.. and if we want to go slightly
> further, I wonder if we could also drop RAM_SAVE_CONTROL_NOT_SUPP.
> 
>      if (!migrate_rdma() || migration_in_postcopy()) {
>          return RAM_SAVE_CONTROL_NOT_SUPP;
>      }
> 
> We should make sure rdma_control_save_page() won't get invoked at all in
> either case above..  

> For postcopy, maybe we could fail in the QMP migrate /
> migrate_incoming cmd, at migration_channels_and_transport_compatible()

I tried to kill RAM_SAVE_CONTROL_NOT_SUPP, but It seems it doesn't need to touch any postcopy logic
"in the QMP migrate / migrate_incoming cmd, at migration_channels_and_transport_compatible()"

Is there something I might have overlooked?

A whole draft diff would be like below:
It includes 3 parts:

migration/rdma: Remove unnecessary RAM_SAVE_CONTROL_NOT_SUPP check in rdma_control_save_page()
migration: kill RAM_SAVE_CONTROL_NOT_SUPP
migration: open control_save_page() to ram_save_target_page()

diff --git a/migration/ram.c b/migration/ram.c
index 589b6505eb2..fc6a964fd64 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1143,32 +1143,6 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
      return len;
  }
  
-/*
- * @pages: the number of pages written by the control path,
- *        < 0 - error
- *        > 0 - number of pages written
- *
- * Return true if the pages has been saved, otherwise false is returned.
- */
-static bool control_save_page(PageSearchStatus *pss,
-                              ram_addr_t offset, int *pages)
-{
-    int ret;
-
-    ret = rdma_control_save_page(pss->pss_channel, pss->block->offset, offset,
-                                 TARGET_PAGE_SIZE);
-    if (ret == RAM_SAVE_CONTROL_NOT_SUPP) {
-        return false;
-    }
-
-    if (ret == RAM_SAVE_CONTROL_DELAYED) {
-        *pages = 1;
-        return true;
-    }
-    *pages = ret;
-    return true;
-}
-
  /*
   * directly send the page to the stream
   *
@@ -1964,6 +1938,16 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
      ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
      int res;
  
+    if (migrate_rdma() && !migration_in_postcopy()) {
+        res = rdma_control_save_page(pss->pss_channel, pss->block->offset,
+                                     offset, TARGET_PAGE_SIZE);
+
+        if (res == RAM_SAVE_CONTROL_DELAYED) {
+            res = 1;
+        }
+        return res;
+    }
+
      if (!migrate_multifd()
          || migrate_zero_page_detection() == ZERO_PAGE_DETECTION_LEGACY) {
          if (save_zero_page(rs, pss, offset)) {
@@ -1976,10 +1960,6 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
          return ram_save_multifd_page(block, offset);
      }
      }
  
-    if (control_save_page(pss, offset, &res)) {
-        return res;
-    }
-
      return ram_save_page(rs, pss);
  }
  
diff --git a/migration/rdma.c b/migration/rdma.c
index 76fb0349238..c6876347e1e 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -3284,14 +3284,11 @@ err:
  int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
                             ram_addr_t offset, size_t size)
  {
-    if (!migrate_rdma() || migration_in_postcopy()) {
-        return RAM_SAVE_CONTROL_NOT_SUPP;
-    }
+    assert(migrate_rdma());
  
      int ret = qemu_rdma_save_page(f, block_offset, offset, size);
  
-    if (ret != RAM_SAVE_CONTROL_DELAYED &&
-        ret != RAM_SAVE_CONTROL_NOT_SUPP) {
+    if (ret != RAM_SAVE_CONTROL_DELAYED) {
          if (ret < 0) {
              qemu_file_set_error(f, ret);
          }
diff --git a/migration/rdma.h b/migration/rdma.h
index f55f28bbed1..bb0296c3726 100644
--- a/migration/rdma.h
+++ b/migration/rdma.h
@@ -33,7 +33,6 @@ void rdma_start_incoming_migration(InetSocketAddress *host_port, Error **errp);
  #define RAM_CONTROL_ROUND     1
  #define RAM_CONTROL_FINISH    3
  
-#define RAM_SAVE_CONTROL_NOT_SUPP -1000
  #define RAM_SAVE_CONTROL_DELAYED  -2000
  
  #ifdef CONFIG_RDMA
@@ -56,7 +55,9 @@ static inline
  int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
                             ram_addr_t offset, size_t size)
  {
-    return RAM_SAVE_CONTROL_NOT_SUPP;
+    /* never reach */
+    assert(0);
+    return -1;
  }
  #endif
  #endif




Thanks
Zhijian

> 
>>
>>>       if (!migrate_multifd()
>>>           || migrate_zero_page_detection() == ZERO_PAGE_DETECTION_LEGACY) {
>>>           if (save_zero_page(rs, pss, offset)) {
>>> @@ -1976,10 +1981,6 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
>>>           return ram_save_multifd_page(block, offset);
>>>       }
>>>   
>>> -    if (control_save_page(pss, offset, &res)) {
>>> -        return res;
>>> -    }
>>> -
>>>       return ram_save_page(rs, pss);
>>>   }
>>
>
Peter Xu Feb. 19, 2025, 1:23 p.m. UTC | #4
On Wed, Feb 19, 2025 at 09:39:38AM +0000, Zhijian Li (Fujitsu) wrote:
> 
> 
> On 19/02/2025 06:03, Peter Xu wrote:
> > On Tue, Feb 18, 2025 at 05:30:40PM -0300, Fabiano Rosas wrote:
> >> Li Zhijian via <qemu-devel@nongnu.org> writes:
> >>
> >>> Address an error in RDMA-based migration by ensuring RDMA is prioritized
> >>> when saving pages in `ram_save_target_page()`.
> >>>
> >>> Previously, the RDMA protocol's page-saving step was placed after other
> >>> protocols due to a refactoring in commit bc38dc2f5f3. This led to migration
> >>> failures characterized by unknown control messages and state loading errors
> >>> destination:
> >>> (qemu) qemu-system-x86_64: Unknown control message QEMU FILE
> >>> qemu-system-x86_64: error while loading state section id 1(ram)
> >>> qemu-system-x86_64: load of migration failed: Operation not permitted
> >>> source:
> >>> (qemu) qemu-system-x86_64: RDMA is in an error state waiting migration to abort!
> >>> qemu-system-x86_64: failed to save SaveStateEntry with id(name): 1(ram): -1
> >>> qemu-system-x86_64: rdma migration: recv polling control error!
> >>> qemu-system-x86_64: warning: Early error. Sending error.
> >>> qemu-system-x86_64: warning: rdma migration: send polling control error
> >>>
> >>> RDMA migration implemented its own protocol/method to send pages to
> >>> destination side, hand over to RDMA first to prevent pages being saved by
> >>> other protocol.
> >>>
> >>> Fixes: bc38dc2f5f3 ("migration: refactor ram_save_target_page functions")
> >>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> >>> ---
> >>>   migration/ram.c | 9 +++++----
> >>>   1 file changed, 5 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/migration/ram.c b/migration/ram.c
> >>> index 6f460fd22d2..635a2fe443a 100644
> >>> --- a/migration/ram.c
> >>> +++ b/migration/ram.c
> >>> @@ -1964,6 +1964,11 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
> >>>       ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
> >>>       int res;
> >>>   
> >>> +    /* Hand over to RDMA first */
> >>> +    if (control_save_page(pss, offset, &res)) {
> >>> +        return res;
> >>> +    }
> >>> +
> >>
> >> Can we hoist that migrate_rdma() from inside the function? Since the
> >> other paths already check first before calling their functions.
> > 
> 
> Yeah, it sounds good to me.
> 
> 
> > If we're talking about hoist and stuff.. and if we want to go slightly
> > further, I wonder if we could also drop RAM_SAVE_CONTROL_NOT_SUPP.
> > 
> >      if (!migrate_rdma() || migration_in_postcopy()) {
> >          return RAM_SAVE_CONTROL_NOT_SUPP;
> >      }
> > 
> > We should make sure rdma_control_save_page() won't get invoked at all in
> > either case above..  
> 
> > For postcopy, maybe we could fail in the QMP migrate /
> > migrate_incoming cmd, at migration_channels_and_transport_compatible()
> 
> I tried to kill RAM_SAVE_CONTROL_NOT_SUPP, but It seems it doesn't need to touch any postcopy logic
> "in the QMP migrate / migrate_incoming cmd, at migration_channels_and_transport_compatible()"
> 
> Is there something I might have overlooked?

Yes it looks almost good.  What I meant is (please see below):

> 
> A whole draft diff would be like below:
> It includes 3 parts:
> 
> migration/rdma: Remove unnecessary RAM_SAVE_CONTROL_NOT_SUPP check in rdma_control_save_page()
> migration: kill RAM_SAVE_CONTROL_NOT_SUPP
> migration: open control_save_page() to ram_save_target_page()
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 589b6505eb2..fc6a964fd64 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1143,32 +1143,6 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
>       return len;
>   }
>   
> -/*
> - * @pages: the number of pages written by the control path,
> - *        < 0 - error
> - *        > 0 - number of pages written
> - *
> - * Return true if the pages has been saved, otherwise false is returned.
> - */
> -static bool control_save_page(PageSearchStatus *pss,
> -                              ram_addr_t offset, int *pages)
> -{
> -    int ret;
> -
> -    ret = rdma_control_save_page(pss->pss_channel, pss->block->offset, offset,
> -                                 TARGET_PAGE_SIZE);
> -    if (ret == RAM_SAVE_CONTROL_NOT_SUPP) {
> -        return false;
> -    }
> -
> -    if (ret == RAM_SAVE_CONTROL_DELAYED) {
> -        *pages = 1;
> -        return true;
> -    }
> -    *pages = ret;
> -    return true;
> -}
> -
>   /*
>    * directly send the page to the stream
>    *
> @@ -1964,6 +1938,16 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
>       ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
>       int res;
>   
> +    if (migrate_rdma() && !migration_in_postcopy()) {

Here instead of bypassing postcopy, we should fail the migrate cmd early if
postcopy ever enabled:

diff --git a/migration/migration.c b/migration/migration.c
index 862f469ea7..3a82e71437 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -257,6 +257,12 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
         return false;
     }
 
+    if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE &&
+        migrate_postcopy_ram()) {
+        error_setg(errp, "RDMA migration doesn't support postcopy");
+        return false;
+    }
+
     return true;
 }

> +        res = rdma_control_save_page(pss->pss_channel, pss->block->offset,
> +                                     offset, TARGET_PAGE_SIZE);
> +
> +        if (res == RAM_SAVE_CONTROL_DELAYED) {
> +            res = 1;
> +        }
> +        return res;
> +    }
> +
>       if (!migrate_multifd()
>           || migrate_zero_page_detection() == ZERO_PAGE_DETECTION_LEGACY) {
>           if (save_zero_page(rs, pss, offset)) {
> @@ -1976,10 +1960,6 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
>           return ram_save_multifd_page(block, offset);
>       }
>       }
>   
> -    if (control_save_page(pss, offset, &res)) {
> -        return res;
> -    }
> -
>       return ram_save_page(rs, pss);
>   }
>   
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 76fb0349238..c6876347e1e 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3284,14 +3284,11 @@ err:
>   int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
>                              ram_addr_t offset, size_t size)
>   {
> -    if (!migrate_rdma() || migration_in_postcopy()) {
> -        return RAM_SAVE_CONTROL_NOT_SUPP;
> -    }
> +    assert(migrate_rdma());
>   
>       int ret = qemu_rdma_save_page(f, block_offset, offset, size);
>   
> -    if (ret != RAM_SAVE_CONTROL_DELAYED &&
> -        ret != RAM_SAVE_CONTROL_NOT_SUPP) {
> +    if (ret != RAM_SAVE_CONTROL_DELAYED) {
>           if (ret < 0) {
>               qemu_file_set_error(f, ret);
>           }
> diff --git a/migration/rdma.h b/migration/rdma.h
> index f55f28bbed1..bb0296c3726 100644
> --- a/migration/rdma.h
> +++ b/migration/rdma.h
> @@ -33,7 +33,6 @@ void rdma_start_incoming_migration(InetSocketAddress *host_port, Error **errp);
>   #define RAM_CONTROL_ROUND     1
>   #define RAM_CONTROL_FINISH    3
>   
> -#define RAM_SAVE_CONTROL_NOT_SUPP -1000
>   #define RAM_SAVE_CONTROL_DELAYED  -2000
>   
>   #ifdef CONFIG_RDMA
> @@ -56,7 +55,9 @@ static inline
>   int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
>                              ram_addr_t offset, size_t size)
>   {
> -    return RAM_SAVE_CONTROL_NOT_SUPP;
> +    /* never reach */
> +    assert(0);
> +    return -1;
>   }
>   #endif
>   #endif
> 
> 
> 
> 
> Thanks
> Zhijian
> 
> > 
> >>
> >>>       if (!migrate_multifd()
> >>>           || migrate_zero_page_detection() == ZERO_PAGE_DETECTION_LEGACY) {
> >>>           if (save_zero_page(rs, pss, offset)) {
> >>> @@ -1976,10 +1981,6 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
> >>>           return ram_save_multifd_page(block, offset);
> >>>       }
> >>>   
> >>> -    if (control_save_page(pss, offset, &res)) {
> >>> -        return res;
> >>> -    }
> >>> -
> >>>       return ram_save_page(rs, pss);
> >>>   }
> >>
> >
Zhijian Li (Fujitsu)" via Feb. 20, 2025, 1:21 a.m. UTC | #5
On 19/02/2025 21:23, Peter Xu wrote:
>> I tried to kill RAM_SAVE_CONTROL_NOT_SUPP, but It seems it doesn't need to touch any postcopy logic
>> "in the QMP migrate / migrate_incoming cmd, at migration_channels_and_transport_compatible()"
>>
>> Is there something I might have overlooked?
> Yes it looks almost good.  What I meant is (please see below):
> 
>> A whole draft diff would be like below:
>> It includes 3 parts:
>>
>> migration/rdma: Remove unnecessary RAM_SAVE_CONTROL_NOT_SUPP check in rdma_control_save_page()
>> migration: kill RAM_SAVE_CONTROL_NOT_SUPP
>> migration: open control_save_page() to ram_save_target_page()
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 589b6505eb2..fc6a964fd64 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1143,32 +1143,6 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
>>        return len;
>>    }
>>    
>> -/*
>> - * @pages: the number of pages written by the control path,
>> - *        < 0 - error
>> - *        > 0 - number of pages written
>> - *
>> - * Return true if the pages has been saved, otherwise false is returned.
>> - */
>> -static bool control_save_page(PageSearchStatus *pss,
>> -                              ram_addr_t offset, int *pages)
>> -{
>> -    int ret;
>> -
>> -    ret = rdma_control_save_page(pss->pss_channel, pss->block->offset, offset,
>> -                                 TARGET_PAGE_SIZE);
>> -    if (ret == RAM_SAVE_CONTROL_NOT_SUPP) {
>> -        return false;
>> -    }
>> -
>> -    if (ret == RAM_SAVE_CONTROL_DELAYED) {
>> -        *pages = 1;
>> -        return true;
>> -    }
>> -    *pages = ret;
>> -    return true;
>> -}
>> -
>>    /*
>>     * directly send the page to the stream
>>     *
>> @@ -1964,6 +1938,16 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
>>        ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
>>        int res;
>>    
>> +    if (migrate_rdma() && !migration_in_postcopy()) {
> Here instead of bypassing postcopy, we should fail the migrate cmd early if
> postcopy ever enabled:
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 862f469ea7..3a82e71437 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -257,6 +257,12 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
>           return false;
>       }
>   
> +    if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE &&
> +        migrate_postcopy_ram()) {

I think there is a typo
s/MIGRATION_ADDRESS_TYPE_FILE/MIGRATION_ADDRESS_TYPE_RDMA


> +        error_setg(errp, "RDMA migration doesn't support postcopy");

IIUC, your change means RDMA + postcopy is no longer supported. I didn't realize this before.
Additionally, we might consider eliminating all remaining `migration_in_postcopy()` conditions in the current `rdma.c` file.

Thanks
Zhijian

> +        return false;
> +    }
> +
>       return true;
>   }
diff mbox series

Patch

diff --git a/migration/ram.c b/migration/ram.c
index 6f460fd22d2..635a2fe443a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1964,6 +1964,11 @@  static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
     ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
     int res;
 
+    /* Hand over to RDMA first */
+    if (control_save_page(pss, offset, &res)) {
+        return res;
+    }
+
     if (!migrate_multifd()
         || migrate_zero_page_detection() == ZERO_PAGE_DETECTION_LEGACY) {
         if (save_zero_page(rs, pss, offset)) {
@@ -1976,10 +1981,6 @@  static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
         return ram_save_multifd_page(block, offset);
     }
 
-    if (control_save_page(pss, offset, &res)) {
-        return res;
-    }
-
     return ram_save_page(rs, pss);
 }