diff mbox

NVDIMM live migration broken?

Message ID 20170627143001.jsfmxqnaig7nfawx@hz-desktop (mailing list archive)
State New, archived
Headers show

Commit Message

Haozhong Zhang June 27, 2017, 2:30 p.m. UTC
On 06/26/17 13:56 +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 26, 2017 at 10:05:01AM +0800, Haozhong Zhang wrote:
> > On 06/23/17 10:55 +0100, Stefan Hajnoczi wrote:
> > > On Fri, Jun 23, 2017 at 08:13:13AM +0800, haozhong.zhang@intel.com wrote:
> > > > On 06/22/17 15:08 +0100, Stefan Hajnoczi wrote:
> > > > > I tried live migrating a guest with NVDIMM on qemu.git/master (edf8bc984):
> > > > > 
> > > > >   $ qemu -M accel=kvm,nvdimm=on -m 1G,slots=4,maxmem=8G -cpu host \
> > > > >          -object memory-backend-file,id=mem1,share=on,mem-path=nvdimm.dat,size=1G \
> > > > > 	 -device nvdimm,id=nvdimm1,memdev=mem1 \
> > > > > 	 -drive if=virtio,file=test.img,format=raw
> > > > > 
> > > > >   $ qemu -M accel=kvm,nvdimm=on -m 1G,slots=4,maxmem=8G -cpu host \
> > > > >          -object memory-backend-file,id=mem1,share=on,mem-path=nvdimm.dat,size=1G \
> > > > > 	 -device nvdimm,id=nvdimm1,memdev=mem1 \
> > > > > 	 -drive if=virtio,file=test.img,format=raw \
> > > > > 	 -incoming tcp::1234
> > > > > 
> > > > >   (qemu) migrate tcp:127.0.0.1:1234
> > > > > 
> > > > > The guest kernel panics or hangs every time on the destination.  It
> > > > > happens as long as the nvdimm device is present - I didn't even mount it
> > > > > inside the guest.
> > > > > 
> > > > > Is migration expected to work?
> > > > 
> > > > Yes, I tested on QEMU 2.8.0 several months ago and it worked. I'll
> > > > have a look at this issue.
> > > 
> > > Great, thanks!
> > > 
> > > David Gilbert suggested the following on IRC, it sounds like a good
> > > starting point for debugging:
> > > 
> > > Launch the destination QEMU with -S (vcpus will be paused) and after
> > > migration has completed, compare the NVDIMM contents on source and
> > > destination.
> > > 
> > 
> > Which host and guest kernel are you testing? Is any workload running
> > in guest when migration?
> > 
> > I just tested QEMU commit edf8bc984 with host/guest kernel 4.8.0, and
> > could not reproduce the issue.
> 
> I can still reproduce the problem on qemu.git edf8bc984.
> 
> My guest kernel is fairly close to yours.  The host kernel is newer.
> 
> Host kernel: 4.11.6-201.fc25.x86_64
> Guest kernel: 4.8.8-300.fc25.x86_64
> 
> Command-line:
> 
>   qemu-system-x86_64 \
>       -enable-kvm \
>       -cpu host \
>       -machine pc,nvdimm \
>       -m 1G,slots=4,maxmem=8G \
>       -object memory-backend-file,id=mem1,share=on,mem-path=nvdimm.dat,size=1G \
>       -device nvdimm,id=nvdimm1,memdev=mem1 \
>       -drive if=virtio,file=test.img,format=raw \
>       -display none \
>       -serial stdio \
>       -monitor unix:/tmp/monitor.sock,server,nowait
> 
> Start migration at the guest login prompt.  You don't need to log in or
> do anything inside the guest.
> 
> There seems to be a guest RAM corruption because I get different
> backtraces inside the guest every time.
> 
> The problem goes away if I remove -device nvdimm.
> 

I managed to reproduce this bug. After bisect between good v2.8.0 and
bad edf8bc984, it looks a regression introduced by
    6b6712efccd "ram: Split dirty bitmap by RAMBlock"
This commit may result in guest crash after migration if any host
memory backend is used.

Could you test whether the attached draft patch fixes this bug? If yes,
I will make a formal patch later.

Thanks,
Haozhong

Comments

Juan Quintela June 27, 2017, 4:58 p.m. UTC | #1
Haozhong Zhang <haozhong.zhang@intel.com> wrote:

....

Hi

I am trying to see what is going on.

>> 
>
> I managed to reproduce this bug. After bisect between good v2.8.0 and
> bad edf8bc984, it looks a regression introduced by
>     6b6712efccd "ram: Split dirty bitmap by RAMBlock"
> This commit may result in guest crash after migration if any host
> memory backend is used.
>
> Could you test whether the attached draft patch fixes this bug? If yes,
> I will make a formal patch later.
>
> Thanks,
> Haozhong
>
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index 73d1bea8b6..2ae4ff3965 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -377,7 +377,9 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>                                                 uint64_t *real_dirty_pages)
>  {
>      ram_addr_t addr;
> +    ram_addr_t offset = rb->offset;
>      unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
> +    unsigned long dirty_page = BIT_WORD((start + offset) >> TARGET_PAGE_BITS);
>      uint64_t num_dirty = 0;
>      unsigned long *dest = rb->bmap;
>  


If this is the case, I can't understand how it ever worked :-(

Investigating.

Later, Juan.

> @@ -386,8 +388,9 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>          int k;
>          int nr = BITS_TO_LONGS(length >> TARGET_PAGE_BITS);
>          unsigned long * const *src;
> -        unsigned long idx = (page * BITS_PER_LONG) / DIRTY_MEMORY_BLOCK_SIZE;
> -        unsigned long offset = BIT_WORD((page * BITS_PER_LONG) %
> +        unsigned long idx = (dirty_page * BITS_PER_LONG) /
> +                            DIRTY_MEMORY_BLOCK_SIZE;
> +        unsigned long offset = BIT_WORD((dirty_page * BITS_PER_LONG) %
>                                          DIRTY_MEMORY_BLOCK_SIZE);
>  
>          rcu_read_lock();
> @@ -416,7 +419,7 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>      } else {
>          for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
>              if (cpu_physical_memory_test_and_clear_dirty(
> -                        start + addr,
> +                        start + addr + offset,
>                          TARGET_PAGE_SIZE,
>                          DIRTY_MEMORY_MIGRATION)) {
>                  *real_dirty_pages += 1;
Juan Quintela June 27, 2017, 6:12 p.m. UTC | #2
Juan Quintela <quintela@redhat.com> wrote:
> Haozhong Zhang <haozhong.zhang@intel.com> wrote:
>
> ....
>
> Hi
>
> I am trying to see what is going on.
>
>>> 
>>
>> I managed to reproduce this bug. After bisect between good v2.8.0 and
>> bad edf8bc984, it looks a regression introduced by
>>     6b6712efccd "ram: Split dirty bitmap by RAMBlock"
>> This commit may result in guest crash after migration if any host
>> memory backend is used.
>>
>> Could you test whether the attached draft patch fixes this bug? If yes,
>> I will make a formal patch later.
>>
>> Thanks,
>> Haozhong
>>
>> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
>> index 73d1bea8b6..2ae4ff3965 100644
>> --- a/include/exec/ram_addr.h
>> +++ b/include/exec/ram_addr.h
>> @@ -377,7 +377,9 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>>                                                 uint64_t *real_dirty_pages)
>>  {
>>      ram_addr_t addr;
>> +    ram_addr_t offset = rb->offset;
>>      unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
>> +    unsigned long dirty_page = BIT_WORD((start + offset) >> TARGET_PAGE_BITS);
>>      uint64_t num_dirty = 0;
>>      unsigned long *dest = rb->bmap;
>>  
>
>
> If this is the case, I can't understand how it ever worked :-(
>
> Investigating.

Further investigation, it gets as:
- pc.ram, by default is at slot 0
- so offset == 0
- rest of devices are not ram-lived

So it work well.

Only ram ends using that function, so we don't care.

When we use nvdimm device (don't know if any other), it just gets out of
ramblock 0, and then the offset is important.

# No NVDIMM

(qemu) info ramblock 
              Block Name    PSize              Offset               Used              Total
                  pc.ram    4 KiB  0x0000000000000000 0x0000000040000000 0x0000000040000000
                vga.vram    4 KiB  0x0000000040060000 0x0000000000400000 0x0000000000400000

# with NVDIMM

(qemu) info ramblock 
              Block Name    PSize              Offset               Used              Total
           /objects/mem1    4 KiB  0x0000000000000000 0x0000000040000000 0x0000000040000000
                  pc.ram    4 KiB  0x0000000040000000 0x0000000040000000 0x0000000040000000
                vga.vram    4 KiB  0x0000000080060000 0x0000000000400000 0x0000000000400000


I am still amused/confused/integrated? how we haven't discovered the
problem before.

The patch fixes the problem described on the thread.


Later, Juan.

>
> Later, Juan.
>
>> @@ -386,8 +388,9 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>>          int k;
>>          int nr = BITS_TO_LONGS(length >> TARGET_PAGE_BITS);
>>          unsigned long * const *src;
>> -        unsigned long idx = (page * BITS_PER_LONG) / DIRTY_MEMORY_BLOCK_SIZE;
>> -        unsigned long offset = BIT_WORD((page * BITS_PER_LONG) %
>> +        unsigned long idx = (dirty_page * BITS_PER_LONG) /
>> +                            DIRTY_MEMORY_BLOCK_SIZE;
>> +        unsigned long offset = BIT_WORD((dirty_page * BITS_PER_LONG) %
>>                                          DIRTY_MEMORY_BLOCK_SIZE);
>>  
>>          rcu_read_lock();
>> @@ -416,7 +419,7 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>>      } else {
>>          for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
>>              if (cpu_physical_memory_test_and_clear_dirty(
>> -                        start + addr,
>> +                        start + addr + offset,
>>                          TARGET_PAGE_SIZE,
>>                          DIRTY_MEMORY_MIGRATION)) {
>>                  *real_dirty_pages += 1;
Stefan Hajnoczi June 28, 2017, 10:05 a.m. UTC | #3
On Tue, Jun 27, 2017 at 10:30:01PM +0800, Haozhong Zhang wrote:
> On 06/26/17 13:56 +0100, Stefan Hajnoczi wrote:
> > On Mon, Jun 26, 2017 at 10:05:01AM +0800, Haozhong Zhang wrote:
> > > On 06/23/17 10:55 +0100, Stefan Hajnoczi wrote:
> > > > On Fri, Jun 23, 2017 at 08:13:13AM +0800, haozhong.zhang@intel.com wrote:
> > > > > On 06/22/17 15:08 +0100, Stefan Hajnoczi wrote:
> > > > > > I tried live migrating a guest with NVDIMM on qemu.git/master (edf8bc984):
> > > > > > 
> > > > > >   $ qemu -M accel=kvm,nvdimm=on -m 1G,slots=4,maxmem=8G -cpu host \
> > > > > >          -object memory-backend-file,id=mem1,share=on,mem-path=nvdimm.dat,size=1G \
> > > > > > 	 -device nvdimm,id=nvdimm1,memdev=mem1 \
> > > > > > 	 -drive if=virtio,file=test.img,format=raw
> > > > > > 
> > > > > >   $ qemu -M accel=kvm,nvdimm=on -m 1G,slots=4,maxmem=8G -cpu host \
> > > > > >          -object memory-backend-file,id=mem1,share=on,mem-path=nvdimm.dat,size=1G \
> > > > > > 	 -device nvdimm,id=nvdimm1,memdev=mem1 \
> > > > > > 	 -drive if=virtio,file=test.img,format=raw \
> > > > > > 	 -incoming tcp::1234
> > > > > > 
> > > > > >   (qemu) migrate tcp:127.0.0.1:1234
> > > > > > 
> > > > > > The guest kernel panics or hangs every time on the destination.  It
> > > > > > happens as long as the nvdimm device is present - I didn't even mount it
> > > > > > inside the guest.
> > > > > > 
> > > > > > Is migration expected to work?
> > > > > 
> > > > > Yes, I tested on QEMU 2.8.0 several months ago and it worked. I'll
> > > > > have a look at this issue.
> > > > 
> > > > Great, thanks!
> > > > 
> > > > David Gilbert suggested the following on IRC, it sounds like a good
> > > > starting point for debugging:
> > > > 
> > > > Launch the destination QEMU with -S (vcpus will be paused) and after
> > > > migration has completed, compare the NVDIMM contents on source and
> > > > destination.
> > > > 
> > > 
> > > Which host and guest kernel are you testing? Is any workload running
> > > in guest when migration?
> > > 
> > > I just tested QEMU commit edf8bc984 with host/guest kernel 4.8.0, and
> > > could not reproduce the issue.
> > 
> > I can still reproduce the problem on qemu.git edf8bc984.
> > 
> > My guest kernel is fairly close to yours.  The host kernel is newer.
> > 
> > Host kernel: 4.11.6-201.fc25.x86_64
> > Guest kernel: 4.8.8-300.fc25.x86_64
> > 
> > Command-line:
> > 
> >   qemu-system-x86_64 \
> >       -enable-kvm \
> >       -cpu host \
> >       -machine pc,nvdimm \
> >       -m 1G,slots=4,maxmem=8G \
> >       -object memory-backend-file,id=mem1,share=on,mem-path=nvdimm.dat,size=1G \
> >       -device nvdimm,id=nvdimm1,memdev=mem1 \
> >       -drive if=virtio,file=test.img,format=raw \
> >       -display none \
> >       -serial stdio \
> >       -monitor unix:/tmp/monitor.sock,server,nowait
> > 
> > Start migration at the guest login prompt.  You don't need to log in or
> > do anything inside the guest.
> > 
> > There seems to be a guest RAM corruption because I get different
> > backtraces inside the guest every time.
> > 
> > The problem goes away if I remove -device nvdimm.
> > 
> 
> I managed to reproduce this bug. After bisect between good v2.8.0 and
> bad edf8bc984, it looks a regression introduced by
>     6b6712efccd "ram: Split dirty bitmap by RAMBlock"
> This commit may result in guest crash after migration if any host
> memory backend is used.
> 
> Could you test whether the attached draft patch fixes this bug? If yes,
> I will make a formal patch later.

Thanks for the fix!  I tested and replied to your v2 patch.

Stefan
diff mbox

Patch

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 73d1bea8b6..2ae4ff3965 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -377,7 +377,9 @@  uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
                                                uint64_t *real_dirty_pages)
 {
     ram_addr_t addr;
+    ram_addr_t offset = rb->offset;
     unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
+    unsigned long dirty_page = BIT_WORD((start + offset) >> TARGET_PAGE_BITS);
     uint64_t num_dirty = 0;
     unsigned long *dest = rb->bmap;
 
@@ -386,8 +388,9 @@  uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
         int k;
         int nr = BITS_TO_LONGS(length >> TARGET_PAGE_BITS);
         unsigned long * const *src;
-        unsigned long idx = (page * BITS_PER_LONG) / DIRTY_MEMORY_BLOCK_SIZE;
-        unsigned long offset = BIT_WORD((page * BITS_PER_LONG) %
+        unsigned long idx = (dirty_page * BITS_PER_LONG) /
+                            DIRTY_MEMORY_BLOCK_SIZE;
+        unsigned long offset = BIT_WORD((dirty_page * BITS_PER_LONG) %
                                         DIRTY_MEMORY_BLOCK_SIZE);
 
         rcu_read_lock();
@@ -416,7 +419,7 @@  uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
     } else {
         for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
             if (cpu_physical_memory_test_and_clear_dirty(
-                        start + addr,
+                        start + addr + offset,
                         TARGET_PAGE_SIZE,
                         DIRTY_MEMORY_MIGRATION)) {
                 *real_dirty_pages += 1;