diff mbox series

[kvm-unit-tests] lib/arm/io: Fix calling getchar() multiple times

Message ID 20240216140210.70280-1-thuth@redhat.com (mailing list archive)
State New, archived
Headers show
Series [kvm-unit-tests] lib/arm/io: Fix calling getchar() multiple times | expand

Commit Message

Thomas Huth Feb. 16, 2024, 2:02 p.m. UTC
getchar() can currently only be called once on arm since the implementation
is a little bit too  naïve: After the first character has arrived, the
data register never gets set to zero again. To properly check whether a
byte is available, we need to check the "RX fifo empty" on the pl011 UART
or the "RX data ready" bit on the ns16550a UART instead.

With this proper check in place, we can finally also get rid of the
ugly assert(count < 16) statement here.

Signed-off-by: Thomas Huth <thuth@redhat.com>
---
 lib/arm/io.c | 34 ++++++++++++++--------------------
 1 file changed, 14 insertions(+), 20 deletions(-)

Comments

Nicholas Piggin Feb. 17, 2024, 10:43 a.m. UTC | #1
On Sat Feb 17, 2024 at 12:02 AM AEST, Thomas Huth wrote:
> getchar() can currently only be called once on arm since the implementation
> is a little bit too  naïve: After the first character has arrived, the
> data register never gets set to zero again. To properly check whether a
> byte is available, we need to check the "RX fifo empty" on the pl011 UART
> or the "RX data ready" bit on the ns16550a UART instead.
>
> With this proper check in place, we can finally also get rid of the
> ugly assert(count < 16) statement here.
>
> Signed-off-by: Thomas Huth <thuth@redhat.com>

Nice, thanks for fixing this up.

I see what you mean about multi-migration not waiting. It seems
to be an arm issue, ppc works properly. This patch changed things
so it works a bit better (or at least differently) now, but
still has some bugs. Maybe buggy uart migration?

Thanks,
Nick
Nicholas Piggin Feb. 17, 2024, 2:28 p.m. UTC | #2
On Sat Feb 17, 2024 at 12:02 AM AEST, Thomas Huth wrote:
> getchar() can currently only be called once on arm since the implementation
> is a little bit too  naïve: After the first character has arrived, the
> data register never gets set to zero again. To properly check whether a
> byte is available, we need to check the "RX fifo empty" on the pl011 UART
> or the "RX data ready" bit on the ns16550a UART instead.
>
> With this proper check in place, we can finally also get rid of the
> ugly assert(count < 16) statement here.
>

Thanks, this seems to work well. But arm64 still behaves strangely
with QEMU migration. It seems some odd corruption around __getchar,
but this turns out not to have anything to do with the MMIO read,
but it's just because the test case generally is spinning there
when a migration happens.

I modified __getchar() to the following, which removes the MMIO
entirely and just waits a number of times called before returning
a getchar, chosen to be enough time for QEMU migration to complete
on my system...

int __getchar(void)
{
        static int i;
        static volatile int locked;
        int ret = -1;

        assert(!locked);
        locked = 1;
        assert(locked);
        cpu_relax();
        assert(locked);

        i++;
        if (i >= 3000000) {
                i = 0;
                ret = 1;
        }

        assert(locked);
        locked = 0;
        assert(!locked);

        return ret;
}

I get asserts on the line right after cpu_relax() on arm64 after a
few migrations on migration selftest.

Without the cpu_relax() it takes longer to hit an assert and it's
usually the first one. I expect this is related to TCG translation
blocks and where it exits back to be migrated. Something is very
strange, I suspect it's QEMU migration bug.

Thanks,
Nick
Thomas Huth Feb. 19, 2024, 6:59 a.m. UTC | #3
On 17/02/2024 11.43, Nicholas Piggin wrote:
> On Sat Feb 17, 2024 at 12:02 AM AEST, Thomas Huth wrote:
>> getchar() can currently only be called once on arm since the implementation
>> is a little bit too  naïve: After the first character has arrived, the
>> data register never gets set to zero again. To properly check whether a
>> byte is available, we need to check the "RX fifo empty" on the pl011 UART
>> or the "RX data ready" bit on the ns16550a UART instead.
>>
>> With this proper check in place, we can finally also get rid of the
>> ugly assert(count < 16) statement here.
>>
>> Signed-off-by: Thomas Huth <thuth@redhat.com>
> 
> Nice, thanks for fixing this up.
> 
> I see what you mean about multi-migration not waiting. It seems
> to be an arm issue, ppc works properly.

Yes, it's an arm issue. s390x also works fine.

> This patch changed things
> so it works a bit better (or at least differently) now, but
> still has some bugs. Maybe buggy uart migration?

I'm also seeing hangs when running the arm migration-test multiple times, 
but also without my UART patch here - so I assume the problem is not really 
related to the UART?

  Thomas
Nicholas Piggin Feb. 19, 2024, 11:58 a.m. UTC | #4
On Mon Feb 19, 2024 at 4:59 PM AEST, Thomas Huth wrote:
> On 17/02/2024 11.43, Nicholas Piggin wrote:
> > On Sat Feb 17, 2024 at 12:02 AM AEST, Thomas Huth wrote:
> >> getchar() can currently only be called once on arm since the implementation
> >> is a little bit too  naïve: After the first character has arrived, the
> >> data register never gets set to zero again. To properly check whether a
> >> byte is available, we need to check the "RX fifo empty" on the pl011 UART
> >> or the "RX data ready" bit on the ns16550a UART instead.
> >>
> >> With this proper check in place, we can finally also get rid of the
> >> ugly assert(count < 16) statement here.
> >>
> >> Signed-off-by: Thomas Huth <thuth@redhat.com>
> > 
> > Nice, thanks for fixing this up.
> > 
> > I see what you mean about multi-migration not waiting. It seems
> > to be an arm issue, ppc works properly.
>
> Yes, it's an arm issue. s390x also works fine.
>
> > This patch changed things
> > so it works a bit better (or at least differently) now, but
> > still has some bugs. Maybe buggy uart migration?
>
> I'm also seeing hangs when running the arm migration-test multiple times, 
> but also without my UART patch here - so I assume the problem is not really 
> related to the UART?

Yeah, I ended up figuring it out. A 11 year old TCG migration memory
corruption bug!

https://lists.gnu.org/archive/html/qemu-devel/2024-02/msg03486.html

All the weirdness was just symptoms of that. The hang that arm usually
got was target machine trying to lock the uart spinlock that is already
locked (because the unlock store got lost in migration).

powerpc and s390x were just luckier in avoiding the race, maybe the way
their translation blocks around getchar code were constructed made the
problem not show up easily or at all. I did end up causing problems
for them by rearranging the code (test case is linked in that msg).

Thanks,
Nick
Andrew Jones Feb. 19, 2024, 1:46 p.m. UTC | #5
On Mon, Feb 19, 2024 at 09:58:20PM +1000, Nicholas Piggin wrote:
> On Mon Feb 19, 2024 at 4:59 PM AEST, Thomas Huth wrote:
> > On 17/02/2024 11.43, Nicholas Piggin wrote:
> > > On Sat Feb 17, 2024 at 12:02 AM AEST, Thomas Huth wrote:
> > >> getchar() can currently only be called once on arm since the implementation
> > >> is a little bit too  naïve: After the first character has arrived, the
> > >> data register never gets set to zero again. To properly check whether a
> > >> byte is available, we need to check the "RX fifo empty" on the pl011 UART
> > >> or the "RX data ready" bit on the ns16550a UART instead.
> > >>
> > >> With this proper check in place, we can finally also get rid of the
> > >> ugly assert(count < 16) statement here.
> > >>
> > >> Signed-off-by: Thomas Huth <thuth@redhat.com>
> > > 
> > > Nice, thanks for fixing this up.
> > > 
> > > I see what you mean about multi-migration not waiting. It seems
> > > to be an arm issue, ppc works properly.
> >
> > Yes, it's an arm issue. s390x also works fine.
> >
> > > This patch changed things
> > > so it works a bit better (or at least differently) now, but
> > > still has some bugs. Maybe buggy uart migration?
> >
> > I'm also seeing hangs when running the arm migration-test multiple times, 
> > but also without my UART patch here - so I assume the problem is not really 
> > related to the UART?
> 
> Yeah, I ended up figuring it out. A 11 year old TCG migration memory
> corruption bug!
> 
> https://lists.gnu.org/archive/html/qemu-devel/2024-02/msg03486.html

Nice! And thanks for bringing this multi-migration test support to
kvm-unit-tests!

drew

> 
> All the weirdness was just symptoms of that. The hang that arm usually
> got was target machine trying to lock the uart spinlock that is already
> locked (because the unlock store got lost in migration).
> 
> powerpc and s390x were just luckier in avoiding the race, maybe the way
> their translation blocks around getchar code were constructed made the
> problem not show up easily or at all. I did end up causing problems
> for them by rearranging the code (test case is linked in that msg).
> 
> Thanks,
> Nick
Andrew Jones Feb. 19, 2024, 2:22 p.m. UTC | #6
On Fri, Feb 16, 2024 at 03:02:10PM +0100, Thomas Huth wrote:
> getchar() can currently only be called once on arm since the implementation
> is a little bit too  naïve: After the first character has arrived, the
> data register never gets set to zero again. To properly check whether a
> byte is available, we need to check the "RX fifo empty" on the pl011 UART
> or the "RX data ready" bit on the ns16550a UART instead.
> 
> With this proper check in place, we can finally also get rid of the
> ugly assert(count < 16) statement here.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  lib/arm/io.c | 34 ++++++++++++++--------------------
>  1 file changed, 14 insertions(+), 20 deletions(-)
> 
> diff --git a/lib/arm/io.c b/lib/arm/io.c
> index c15e57c4..836fa854 100644
> --- a/lib/arm/io.c
> +++ b/lib/arm/io.c
> @@ -28,6 +28,7 @@ static struct spinlock uart_lock;
>   */
>  #define UART_EARLY_BASE (u8 *)(unsigned long)CONFIG_UART_EARLY_BASE
>  static volatile u8 *uart0_base = UART_EARLY_BASE;
> +bool is_pl011_uart;
>  
>  static void uart0_init_fdt(void)
>  {
> @@ -59,7 +60,10 @@ static void uart0_init_fdt(void)
>  			abort();
>  		}
>  
> +		is_pl011_uart = (i == 0);
>  	} else {
> +		is_pl011_uart = !fdt_node_check_compatible(dt_fdt(), ret,
> +		                                           "arm,pl011");
>  		ret = dt_pbus_translate_node(ret, 0, &base);
>  		assert(ret == 0);
>  	}
> @@ -111,31 +115,21 @@ void puts(const char *s)
>  	spin_unlock(&uart_lock);
>  }
>  
> -static int do_getchar(void)
> +int __getchar(void)
>  {
> -	int c;
> +	int c = -1;
>  
>  	spin_lock(&uart_lock);
> -	c = readb(uart0_base);
> -	spin_unlock(&uart_lock);
> -
> -	return c ?: -1;
> -}
> -
> -/*
> - * Minimalist implementation for migration completion detection.
> - * Without FIFOs enabled on the QEMU UART device we just read
> - * the data register: we cannot read more than 16 characters.
> - */
> -int __getchar(void)
> -{
> -	int c = do_getchar();
> -	static int count;
>  
> -	if (c != -1)
> -		++count;
> +	if (is_pl011_uart) {
> +		if (!(readb(uart0_base + 6 * 4) & 0x10))  /* RX not empty? */
> +			c = readb(uart0_base);
> +	} else {
> +		if (readb(uart0_base + 5) & 0x01)         /* RX data ready? */
> +			c = readb(uart0_base);
> +	}

I think I'd eventually prefer encapsulating these separate implementations
into a console device, but that might be a bit heavy handed for a fix that
we want to pull into the multi-migration series.

Thanks,
drew
Alexandru Elisei Feb. 19, 2024, 4:56 p.m. UTC | #7
Hi,

Thanks for writing this. I've tested it with kvmtool, which emulates a 8250
UART:

Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>

This fixes a longstanding bug with kvmtool, where migrate_once() would read
the last character that was sent, and then think that migration was
completed even though it was never performed.

While we are on the subject of migration:

SKIP: gicv3: its-migration: Test requires at least 4 vcpus
Now migrate the VM, then press a key to continue...
INFO: gicv3: its-migration: Migration complete
SUMMARY: 1 tests, 1 skipped

That's extremely confusing. Why is migrate_once() executed after the
test_its_pending() function call without checking if the test was skipped?

Nitpicks below.

On Fri, Feb 16, 2024 at 03:02:10PM +0100, Thomas Huth wrote:
> getchar() can currently only be called once on arm since the implementation
> is a little bit too  naïve: After the first character has arrived, the
> data register never gets set to zero again. To properly check whether a
> byte is available, we need to check the "RX fifo empty" on the pl011 UART
> or the "RX data ready" bit on the ns16550a UART instead.
> 
> With this proper check in place, we can finally also get rid of the
> ugly assert(count < 16) statement here.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  lib/arm/io.c | 34 ++++++++++++++--------------------
>  1 file changed, 14 insertions(+), 20 deletions(-)
> 
> diff --git a/lib/arm/io.c b/lib/arm/io.c
> index c15e57c4..836fa854 100644
> --- a/lib/arm/io.c
> +++ b/lib/arm/io.c
> @@ -28,6 +28,7 @@ static struct spinlock uart_lock;
>   */
>  #define UART_EARLY_BASE (u8 *)(unsigned long)CONFIG_UART_EARLY_BASE
>  static volatile u8 *uart0_base = UART_EARLY_BASE;
> +bool is_pl011_uart;
>  
>  static void uart0_init_fdt(void)
>  {
> @@ -59,7 +60,10 @@ static void uart0_init_fdt(void)
>  			abort();
>  		}
>  
> +		is_pl011_uart = (i == 0);
>  	} else {
> +		is_pl011_uart = !fdt_node_check_compatible(dt_fdt(), ret,
> +		                                           "arm,pl011");
>  		ret = dt_pbus_translate_node(ret, 0, &base);
>  		assert(ret == 0);
>  	}
> @@ -111,31 +115,21 @@ void puts(const char *s)
>  	spin_unlock(&uart_lock);
>  }
>  
> -static int do_getchar(void)
> +int __getchar(void)
>  {
> -	int c;
> +	int c = -1;
>  
>  	spin_lock(&uart_lock);
> -	c = readb(uart0_base);
> -	spin_unlock(&uart_lock);
> -
> -	return c ?: -1;
> -}
> -
> -/*
> - * Minimalist implementation for migration completion detection.
> - * Without FIFOs enabled on the QEMU UART device we just read
> - * the data register: we cannot read more than 16 characters.
> - */
> -int __getchar(void)
> -{
> -	int c = do_getchar();
> -	static int count;
>  
> -	if (c != -1)
> -		++count;
> +	if (is_pl011_uart) {
> +		if (!(readb(uart0_base + 6 * 4) & 0x10))  /* RX not empty? */

I think it would be useful if the magic numbers were replaced by something
less opaque, something like:

		if (!(readb(uart0_base + PL011_UARTFR) & PL011_UARTFR_RXFE))

> +			c = readb(uart0_base);
> +	} else {
> +		if (readb(uart0_base + 5) & 0x01)         /* RX data ready? */

Same as above, perhaps:

		if (readb(uart0_base + UART16550_LSR) & UART16550_LSR_DR)

Naming of course being subject to taste.

Thanks,
Alex

> +			c = readb(uart0_base);
> +	}
>  
> -	assert(count < 16);
> +	spin_unlock(&uart_lock);
>  
>  	return c;
>  }
> -- 
> 2.43.0
>
Nicholas Piggin Feb. 20, 2024, 1:37 a.m. UTC | #8
On Tue Feb 20, 2024 at 2:56 AM AEST, Alexandru Elisei wrote:
> Hi,
>
> Thanks for writing this. I've tested it with kvmtool, which emulates a 8250
> UART:
>
> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
>
> This fixes a longstanding bug with kvmtool, where migrate_once() would read
> the last character that was sent, and then think that migration was
> completed even though it was never performed.
>
> While we are on the subject of migration:
>
> SKIP: gicv3: its-migration: Test requires at least 4 vcpus
> Now migrate the VM, then press a key to continue...
> INFO: gicv3: its-migration: Migration complete
> SUMMARY: 1 tests, 1 skipped
>
> That's extremely confusing. Why is migrate_once() executed after the
> test_its_pending() function call without checking if the test was skipped?

Looks like it was done so the test is skipped without killing the harness
due to expected migration point not being reached.

After the multi-migration series, you could just put a migrate_quiet()
there as a quick fix.

But I'm thinking we could just remove the requirement for the harness
to see at least 1 migration point. The test itself knows how many
migrations it should perform, by how many times it calls migrate*().
It is somewhat a sanity test against test being invoked the wrong way
and not doing the migration, but how much is that worth...? Actually
we could have a new sideband message like "VM migration is skipped"
that doesn't do anything except tell the test harness to not fail
due to missing migration point. That gets the best of both worlds,
just needs tests to be updated.

Thanks,
Nick

>
> Nitpicks below.
>
> On Fri, Feb 16, 2024 at 03:02:10PM +0100, Thomas Huth wrote:
> > getchar() can currently only be called once on arm since the implementation
> > is a little bit too  naïve: After the first character has arrived, the
> > data register never gets set to zero again. To properly check whether a
> > byte is available, we need to check the "RX fifo empty" on the pl011 UART
> > or the "RX data ready" bit on the ns16550a UART instead.
> > 
> > With this proper check in place, we can finally also get rid of the
> > ugly assert(count < 16) statement here.
> > 
> > Signed-off-by: Thomas Huth <thuth@redhat.com>
> > ---
> >  lib/arm/io.c | 34 ++++++++++++++--------------------
> >  1 file changed, 14 insertions(+), 20 deletions(-)
> > 
> > diff --git a/lib/arm/io.c b/lib/arm/io.c
> > index c15e57c4..836fa854 100644
> > --- a/lib/arm/io.c
> > +++ b/lib/arm/io.c
> > @@ -28,6 +28,7 @@ static struct spinlock uart_lock;
> >   */
> >  #define UART_EARLY_BASE (u8 *)(unsigned long)CONFIG_UART_EARLY_BASE
> >  static volatile u8 *uart0_base = UART_EARLY_BASE;
> > +bool is_pl011_uart;
> >  
> >  static void uart0_init_fdt(void)
> >  {
> > @@ -59,7 +60,10 @@ static void uart0_init_fdt(void)
> >  			abort();
> >  		}
> >  
> > +		is_pl011_uart = (i == 0);
> >  	} else {
> > +		is_pl011_uart = !fdt_node_check_compatible(dt_fdt(), ret,
> > +		                                           "arm,pl011");
> >  		ret = dt_pbus_translate_node(ret, 0, &base);
> >  		assert(ret == 0);
> >  	}
> > @@ -111,31 +115,21 @@ void puts(const char *s)
> >  	spin_unlock(&uart_lock);
> >  }
> >  
> > -static int do_getchar(void)
> > +int __getchar(void)
> >  {
> > -	int c;
> > +	int c = -1;
> >  
> >  	spin_lock(&uart_lock);
> > -	c = readb(uart0_base);
> > -	spin_unlock(&uart_lock);
> > -
> > -	return c ?: -1;
> > -}
> > -
> > -/*
> > - * Minimalist implementation for migration completion detection.
> > - * Without FIFOs enabled on the QEMU UART device we just read
> > - * the data register: we cannot read more than 16 characters.
> > - */
> > -int __getchar(void)
> > -{
> > -	int c = do_getchar();
> > -	static int count;
> >  
> > -	if (c != -1)
> > -		++count;
> > +	if (is_pl011_uart) {
> > +		if (!(readb(uart0_base + 6 * 4) & 0x10))  /* RX not empty? */
>
> I think it would be useful if the magic numbers were replaced by something
> less opaque, something like:
>
> 		if (!(readb(uart0_base + PL011_UARTFR) & PL011_UARTFR_RXFE))
>
> > +			c = readb(uart0_base);
> > +	} else {
> > +		if (readb(uart0_base + 5) & 0x01)         /* RX data ready? */
>
> Same as above, perhaps:
>
> 		if (readb(uart0_base + UART16550_LSR) & UART16550_LSR_DR)
>
> Naming of course being subject to taste.
>
> Thanks,
> Alex
>
> > +			c = readb(uart0_base);
> > +	}
> >  
> > -	assert(count < 16);
> > +	spin_unlock(&uart_lock);
> >  
> >  	return c;
> >  }
> > -- 
> > 2.43.0
> >
Nicholas Piggin Feb. 20, 2024, 8:51 a.m. UTC | #9
On Tue Feb 20, 2024 at 2:56 AM AEST, Alexandru Elisei wrote:
> Hi,
>
> Thanks for writing this. I've tested it with kvmtool, which emulates a 8250
> UART:
>
> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
>
> This fixes a longstanding bug with kvmtool, where migrate_once() would read
> the last character that was sent, and then think that migration was
> completed even though it was never performed.
>
> While we are on the subject of migration:
>
> SKIP: gicv3: its-migration: Test requires at least 4 vcpus
> Now migrate the VM, then press a key to continue...
> INFO: gicv3: its-migration: Migration complete
> SUMMARY: 1 tests, 1 skipped
>
> That's extremely confusing. Why is migrate_once() executed after the
> test_its_pending() function call without checking if the test was skipped?

Seems not too hard, incremental patch on top of multi migration
series below. After this series is merged I can try that (s390
could benefit with some changes too).

Thanks,
Nick

---
diff --git a/arm/gic.c b/arm/gic.c
index c950b0d15..bbf828f17 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -782,13 +782,15 @@ static void test_its_migration(void)
 	struct its_device *dev2, *dev7;
 	cpumask_t mask;
 
-	if (its_setup1())
+	if (its_setup1()) {
+		migrate_skip();
 		return;
+	}
 
 	dev2 = its_get_device(2);
 	dev7 = its_get_device(7);
 
-	migrate_once();
+	migrate();
 
 	stats_reset();
 	cpumask_clear(&mask);
@@ -819,8 +821,10 @@ static void test_migrate_unmapped_collection(void)
 	int pe0 = 0;
 	u8 config;
 
-	if (its_setup1())
+	if (its_setup1()) {
+		migrate_skip();
 		return;
+	}
 
 	if (!errata(ERRATA_UNMAPPED_COLLECTIONS)) {
 		report_skip("Skipping test, as this test hangs without the fix. "
@@ -836,7 +840,7 @@ static void test_migrate_unmapped_collection(void)
 	its_send_mapti(dev2, 8192, 0, col);
 	gicv3_lpi_set_config(8192, LPI_PROP_DEFAULT);
 
-	migrate_once();
+	migrate();
 
 	/* on the destination, map the collection */
 	its_send_mapc(col, true);
@@ -875,8 +879,10 @@ static void test_its_pending_migration(void)
 	void *ptr;
 	int i;
 
-	if (its_prerequisites(4))
+	if (its_prerequisites(4)) {
+		migrate_skip();
 		return;
+	}
 
 	dev = its_create_device(2 /* dev id */, 8 /* nb_ites */);
 	its_send_mapd(dev, true);
@@ -923,7 +929,7 @@ static void test_its_pending_migration(void)
 	gicv3_lpi_rdist_enable(pe0);
 	gicv3_lpi_rdist_enable(pe1);
 
-	migrate_once();
+	migrate();
 
 	/* let's wait for the 256 LPIs to be handled */
 	mdelay(1000);
@@ -970,17 +976,14 @@ int main(int argc, char **argv)
 	} else if (!strcmp(argv[1], "its-migration")) {
 		report_prefix_push(argv[1]);
 		test_its_migration();
-		migrate_once();
 		report_prefix_pop();
 	} else if (!strcmp(argv[1], "its-pending-migration")) {
 		report_prefix_push(argv[1]);
 		test_its_pending_migration();
-		migrate_once();
 		report_prefix_pop();
 	} else if (!strcmp(argv[1], "its-migrate-unmapped-collection")) {
 		report_prefix_push(argv[1]);
 		test_migrate_unmapped_collection();
-		migrate_once();
 		report_prefix_pop();
 	} else if (strcmp(argv[1], "its-introspection") == 0) {
 		report_prefix_push(argv[1]);
diff --git a/lib/migrate.c b/lib/migrate.c
index 92d1d957d..dde43a90e 100644
--- a/lib/migrate.c
+++ b/lib/migrate.c
@@ -43,3 +43,13 @@ void migrate_once(void)
 	migrated = true;
 	migrate();
 }
+
+/*
+ * When the test has been started in migration mode, but the test case is
+ * skipped and no migration point is reached, this can be used to tell the
+ * harness not to mark it as a failure to migrate.
+ */
+void migrate_skip(void)
+{
+	puts("Skipped VM migration (quiet)\n");
+}
diff --git a/lib/migrate.h b/lib/migrate.h
index 95b9102b0..db6e0c501 100644
--- a/lib/migrate.h
+++ b/lib/migrate.h
@@ -9,3 +9,5 @@
 void migrate(void);
 void migrate_quiet(void);
 void migrate_once(void);
+
+void migrate_skip(void);
diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
index 2214d940c..3257d5218 100644
--- a/scripts/arch-run.bash
+++ b/scripts/arch-run.bash
@@ -152,7 +152,9 @@ run_migration ()
 		-chardev socket,id=mon,path=${src_qmp},server=on,wait=off \
 		-mon chardev=mon,mode=control > ${src_outfifo} &
 	live_pid=$!
-	cat ${src_outfifo} | tee ${src_out} | grep -v "Now migrate the VM (quiet)" &
+	cat ${src_outfifo} | tee ${src_out} | \
+		grep -v "Now migrate the VM (quiet)" | \
+		grep -v "Skipped VM migration (quiet)" &
 
 	# Start the first destination QEMU machine in advance of the test
 	# reaching the migration point, since we expect at least one migration.
@@ -190,16 +192,22 @@ do_migration ()
 		-mon chardev=mon,mode=control -incoming unix:${dst_incoming} \
 		< <(cat ${dst_infifo}) > ${dst_outfifo} &
 	incoming_pid=$!
-	cat ${dst_outfifo} | tee ${dst_out} | grep -v "Now migrate the VM (quiet)" &
+	cat ${dst_outfifo} | tee ${dst_out} | \
+		grep -v "Now migrate the VM (quiet)" | \
+		grep -v "Skipped VM migration (quiet)" &
 
 	# The test must prompt the user to migrate, so wait for the
 	# "Now migrate VM" console message.
 	while ! grep -q -i "Now migrate the VM" < ${src_out} ; do
 		if ! ps -p ${live_pid} > /dev/null ; then
-			echo "ERROR: Test exit before migration point." >&2
 			echo > ${dst_infifo}
-			qmp ${src_qmp} '"quit"'> ${src_qmpout} 2>/dev/null
 			qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null
+			if grep -q -i "Skipped VM migration" < ${src_out} ; then
+				wait ${live_pid}
+				return $?
+			fi
+			echo "ERROR: Test exit before migration point." >&2
+			qmp ${src_qmp} '"quit"'> ${src_qmpout} 2>/dev/null
 			return 3
 		fi
 		sleep 0.1
Alexandru Elisei Feb. 20, 2024, 10:22 a.m. UTC | #10
Hi,

On Tue, Feb 20, 2024 at 06:51:51PM +1000, Nicholas Piggin wrote:
> On Tue Feb 20, 2024 at 2:56 AM AEST, Alexandru Elisei wrote:
> > Hi,
> >
> > Thanks for writing this. I've tested it with kvmtool, which emulates a 8250
> > UART:
> >
> > Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
> >
> > This fixes a longstanding bug with kvmtool, where migrate_once() would read
> > the last character that was sent, and then think that migration was
> > completed even though it was never performed.
> >
> > While we are on the subject of migration:
> >
> > SKIP: gicv3: its-migration: Test requires at least 4 vcpus
> > Now migrate the VM, then press a key to continue...
> > INFO: gicv3: its-migration: Migration complete
> > SUMMARY: 1 tests, 1 skipped
> >
> > That's extremely confusing. Why is migrate_once() executed after the
> > test_its_pending() function call without checking if the test was skipped?
> 
> Seems not too hard, incremental patch on top of multi migration
> series below. After this series is merged I can try that (s390
> could benefit with some changes too).

Thank you for having a look at this so quickly. The changes to the gic test
look good to me.

As an alternative, have you considered modifying the test harness to parse
the SKIP message, and if the test is in the migration group to not mark it
as a migration failure? That would require that the test name printed by a
test matches the test name from unittests.cfg (should probably be the case
already), but any new migration tests will just work, without having to put
migrate_skip() at each failure point.

Thanks,
Alex

> 
> Thanks,
> Nick
> 
> ---
> diff --git a/arm/gic.c b/arm/gic.c
> index c950b0d15..bbf828f17 100644
> --- a/arm/gic.c
> +++ b/arm/gic.c
> @@ -782,13 +782,15 @@ static void test_its_migration(void)
>  	struct its_device *dev2, *dev7;
>  	cpumask_t mask;
>  
> -	if (its_setup1())
> +	if (its_setup1()) {
> +		migrate_skip();
>  		return;
> +	}
>  
>  	dev2 = its_get_device(2);
>  	dev7 = its_get_device(7);
>  
> -	migrate_once();
> +	migrate();
>  
>  	stats_reset();
>  	cpumask_clear(&mask);
> @@ -819,8 +821,10 @@ static void test_migrate_unmapped_collection(void)
>  	int pe0 = 0;
>  	u8 config;
>  
> -	if (its_setup1())
> +	if (its_setup1()) {
> +		migrate_skip();
>  		return;
> +	}
>  
>  	if (!errata(ERRATA_UNMAPPED_COLLECTIONS)) {
>  		report_skip("Skipping test, as this test hangs without the fix. "
> @@ -836,7 +840,7 @@ static void test_migrate_unmapped_collection(void)
>  	its_send_mapti(dev2, 8192, 0, col);
>  	gicv3_lpi_set_config(8192, LPI_PROP_DEFAULT);
>  
> -	migrate_once();
> +	migrate();
>  
>  	/* on the destination, map the collection */
>  	its_send_mapc(col, true);
> @@ -875,8 +879,10 @@ static void test_its_pending_migration(void)
>  	void *ptr;
>  	int i;
>  
> -	if (its_prerequisites(4))
> +	if (its_prerequisites(4)) {
> +		migrate_skip();
>  		return;
> +	}
>  
>  	dev = its_create_device(2 /* dev id */, 8 /* nb_ites */);
>  	its_send_mapd(dev, true);
> @@ -923,7 +929,7 @@ static void test_its_pending_migration(void)
>  	gicv3_lpi_rdist_enable(pe0);
>  	gicv3_lpi_rdist_enable(pe1);
>  
> -	migrate_once();
> +	migrate();
>  
>  	/* let's wait for the 256 LPIs to be handled */
>  	mdelay(1000);
> @@ -970,17 +976,14 @@ int main(int argc, char **argv)
>  	} else if (!strcmp(argv[1], "its-migration")) {
>  		report_prefix_push(argv[1]);
>  		test_its_migration();
> -		migrate_once();
>  		report_prefix_pop();
>  	} else if (!strcmp(argv[1], "its-pending-migration")) {
>  		report_prefix_push(argv[1]);
>  		test_its_pending_migration();
> -		migrate_once();
>  		report_prefix_pop();
>  	} else if (!strcmp(argv[1], "its-migrate-unmapped-collection")) {
>  		report_prefix_push(argv[1]);
>  		test_migrate_unmapped_collection();
> -		migrate_once();
>  		report_prefix_pop();
>  	} else if (strcmp(argv[1], "its-introspection") == 0) {
>  		report_prefix_push(argv[1]);
> diff --git a/lib/migrate.c b/lib/migrate.c
> index 92d1d957d..dde43a90e 100644
> --- a/lib/migrate.c
> +++ b/lib/migrate.c
> @@ -43,3 +43,13 @@ void migrate_once(void)
>  	migrated = true;
>  	migrate();
>  }
> +
> +/*
> + * When the test has been started in migration mode, but the test case is
> + * skipped and no migration point is reached, this can be used to tell the
> + * harness not to mark it as a failure to migrate.
> + */
> +void migrate_skip(void)
> +{
> +	puts("Skipped VM migration (quiet)\n");
> +}
> diff --git a/lib/migrate.h b/lib/migrate.h
> index 95b9102b0..db6e0c501 100644
> --- a/lib/migrate.h
> +++ b/lib/migrate.h
> @@ -9,3 +9,5 @@
>  void migrate(void);
>  void migrate_quiet(void);
>  void migrate_once(void);
> +
> +void migrate_skip(void);
> diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
> index 2214d940c..3257d5218 100644
> --- a/scripts/arch-run.bash
> +++ b/scripts/arch-run.bash
> @@ -152,7 +152,9 @@ run_migration ()
>  		-chardev socket,id=mon,path=${src_qmp},server=on,wait=off \
>  		-mon chardev=mon,mode=control > ${src_outfifo} &
>  	live_pid=$!
> -	cat ${src_outfifo} | tee ${src_out} | grep -v "Now migrate the VM (quiet)" &
> +	cat ${src_outfifo} | tee ${src_out} | \
> +		grep -v "Now migrate the VM (quiet)" | \
> +		grep -v "Skipped VM migration (quiet)" &
>  
>  	# Start the first destination QEMU machine in advance of the test
>  	# reaching the migration point, since we expect at least one migration.
> @@ -190,16 +192,22 @@ do_migration ()
>  		-mon chardev=mon,mode=control -incoming unix:${dst_incoming} \
>  		< <(cat ${dst_infifo}) > ${dst_outfifo} &
>  	incoming_pid=$!
> -	cat ${dst_outfifo} | tee ${dst_out} | grep -v "Now migrate the VM (quiet)" &
> +	cat ${dst_outfifo} | tee ${dst_out} | \
> +		grep -v "Now migrate the VM (quiet)" | \
> +		grep -v "Skipped VM migration (quiet)" &
>  
>  	# The test must prompt the user to migrate, so wait for the
>  	# "Now migrate VM" console message.
>  	while ! grep -q -i "Now migrate the VM" < ${src_out} ; do
>  		if ! ps -p ${live_pid} > /dev/null ; then
> -			echo "ERROR: Test exit before migration point." >&2
>  			echo > ${dst_infifo}
> -			qmp ${src_qmp} '"quit"'> ${src_qmpout} 2>/dev/null
>  			qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null
> +			if grep -q -i "Skipped VM migration" < ${src_out} ; then
> +				wait ${live_pid}
> +				return $?
> +			fi
> +			echo "ERROR: Test exit before migration point." >&2
> +			qmp ${src_qmp} '"quit"'> ${src_qmpout} 2>/dev/null
>  			return 3
>  		fi
>  		sleep 0.1
Nicholas Piggin Feb. 21, 2024, 3:39 a.m. UTC | #11
On Sat Feb 17, 2024 at 12:02 AM AEST, Thomas Huth wrote:
> getchar() can currently only be called once on arm since the implementation
> is a little bit too  naïve: After the first character has arrived, the
> data register never gets set to zero again. To properly check whether a
> byte is available, we need to check the "RX fifo empty" on the pl011 UART
> or the "RX data ready" bit on the ns16550a UART instead.
>
> With this proper check in place, we can finally also get rid of the
> ugly assert(count < 16) statement here.
>
> Signed-off-by: Thomas Huth <thuth@redhat.com>

With the QEMU bug fix, the multi migration patches work well on
arm64 with this patch for me (on TCG).

Tested-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

> ---
>  lib/arm/io.c | 34 ++++++++++++++--------------------
>  1 file changed, 14 insertions(+), 20 deletions(-)
>
> diff --git a/lib/arm/io.c b/lib/arm/io.c
> index c15e57c4..836fa854 100644
> --- a/lib/arm/io.c
> +++ b/lib/arm/io.c
> @@ -28,6 +28,7 @@ static struct spinlock uart_lock;
>   */
>  #define UART_EARLY_BASE (u8 *)(unsigned long)CONFIG_UART_EARLY_BASE
>  static volatile u8 *uart0_base = UART_EARLY_BASE;
> +bool is_pl011_uart;
>  
>  static void uart0_init_fdt(void)
>  {
> @@ -59,7 +60,10 @@ static void uart0_init_fdt(void)
>  			abort();
>  		}
>  
> +		is_pl011_uart = (i == 0);
>  	} else {
> +		is_pl011_uart = !fdt_node_check_compatible(dt_fdt(), ret,
> +		                                           "arm,pl011");
>  		ret = dt_pbus_translate_node(ret, 0, &base);
>  		assert(ret == 0);
>  	}
> @@ -111,31 +115,21 @@ void puts(const char *s)
>  	spin_unlock(&uart_lock);
>  }
>  
> -static int do_getchar(void)
> +int __getchar(void)
>  {
> -	int c;
> +	int c = -1;
>  
>  	spin_lock(&uart_lock);
> -	c = readb(uart0_base);
> -	spin_unlock(&uart_lock);
> -
> -	return c ?: -1;
> -}
> -
> -/*
> - * Minimalist implementation for migration completion detection.
> - * Without FIFOs enabled on the QEMU UART device we just read
> - * the data register: we cannot read more than 16 characters.
> - */
> -int __getchar(void)
> -{
> -	int c = do_getchar();
> -	static int count;
>  
> -	if (c != -1)
> -		++count;
> +	if (is_pl011_uart) {
> +		if (!(readb(uart0_base + 6 * 4) & 0x10))  /* RX not empty? */
> +			c = readb(uart0_base);
> +	} else {
> +		if (readb(uart0_base + 5) & 0x01)         /* RX data ready? */
> +			c = readb(uart0_base);
> +	}
>  
> -	assert(count < 16);
> +	spin_unlock(&uart_lock);
>  
>  	return c;
>  }
Andrew Jones Feb. 21, 2024, 7:42 a.m. UTC | #12
On Fri, Feb 16, 2024 at 03:02:10PM +0100, Thomas Huth wrote:
> getchar() can currently only be called once on arm since the implementation
> is a little bit too  naïve: After the first character has arrived, the
> data register never gets set to zero again. To properly check whether a
> byte is available, we need to check the "RX fifo empty" on the pl011 UART
> or the "RX data ready" bit on the ns16550a UART instead.
> 
> With this proper check in place, we can finally also get rid of the
> ugly assert(count < 16) statement here.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  lib/arm/io.c | 34 ++++++++++++++--------------------
>  1 file changed, 14 insertions(+), 20 deletions(-)
>

Acked-by: Andrew Jones <andrew.jones@linux.dev>
diff mbox series

Patch

diff --git a/lib/arm/io.c b/lib/arm/io.c
index c15e57c4..836fa854 100644
--- a/lib/arm/io.c
+++ b/lib/arm/io.c
@@ -28,6 +28,7 @@  static struct spinlock uart_lock;
  */
 #define UART_EARLY_BASE (u8 *)(unsigned long)CONFIG_UART_EARLY_BASE
 static volatile u8 *uart0_base = UART_EARLY_BASE;
+bool is_pl011_uart;
 
 static void uart0_init_fdt(void)
 {
@@ -59,7 +60,10 @@  static void uart0_init_fdt(void)
 			abort();
 		}
 
+		is_pl011_uart = (i == 0);
 	} else {
+		is_pl011_uart = !fdt_node_check_compatible(dt_fdt(), ret,
+		                                           "arm,pl011");
 		ret = dt_pbus_translate_node(ret, 0, &base);
 		assert(ret == 0);
 	}
@@ -111,31 +115,21 @@  void puts(const char *s)
 	spin_unlock(&uart_lock);
 }
 
-static int do_getchar(void)
+int __getchar(void)
 {
-	int c;
+	int c = -1;
 
 	spin_lock(&uart_lock);
-	c = readb(uart0_base);
-	spin_unlock(&uart_lock);
-
-	return c ?: -1;
-}
-
-/*
- * Minimalist implementation for migration completion detection.
- * Without FIFOs enabled on the QEMU UART device we just read
- * the data register: we cannot read more than 16 characters.
- */
-int __getchar(void)
-{
-	int c = do_getchar();
-	static int count;
 
-	if (c != -1)
-		++count;
+	if (is_pl011_uart) {
+		if (!(readb(uart0_base + 6 * 4) & 0x10))  /* RX not empty? */
+			c = readb(uart0_base);
+	} else {
+		if (readb(uart0_base + 5) & 0x01)         /* RX data ready? */
+			c = readb(uart0_base);
+	}
 
-	assert(count < 16);
+	spin_unlock(&uart_lock);
 
 	return c;
 }