Message ID | 20220817105720.111618-1-leo.yan@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | xen/arm: acpi: Support memory reserve configuration table | expand |
On 17.08.2022 12:57, Leo Yan wrote: > On Arm64, when boot Dom0 with the Linux kernel, it reports the warning: > > [ 0.403737] ------------[ cut here ]------------ > [ 0.403738] WARNING: CPU: 30 PID: 0 at drivers/irqchip/irq-gic-v3-its.c:3074 its_cpu_init+0x814/0xae0 > [ 0.403745] Modules linked in: > [ 0.403748] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G W 5.15.23-ampere-lts-standard #1 > [ 0.403752] pstate: 600001c5 (nZCv dAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 0.403755] pc : its_cpu_init+0x814/0xae0 > [ 0.403758] lr : its_cpu_init+0x810/0xae0 > [ 0.403761] sp : ffff800009c03ce0 > [ 0.403762] x29: ffff800009c03ce0 x28: 000000000000001e x27: ffff880711f43000 > [ 0.403767] x26: ffff80000a3c0070 x25: fffffc1ffe0a4400 x24: ffff80000a3c0000 > [ 0.403770] x23: ffff8000095bc998 x22: ffff8000090a6000 x21: ffff800009850cb0 > [ 0.403774] x20: ffff800009701a10 x19: ffff800009701000 x18: ffffffffffffffff > [ 0.403777] x17: 3030303035303031 x16: 3030313030303078 x15: 303a30206e6f6967 > [ 0.403780] x14: 6572206530312072 x13: 3030303030353030 x12: 3130303130303030 > [ 0.403784] x11: 78303a30206e6f69 x10: 6765722065303120 x9 : ffff80000870e710 > [ 0.403788] x8 : 6964657220646e75 x7 : 0000000000000003 x6 : 0000000000000000 > [ 0.403791] x5 : 0000000000000000 x4 : fffffc0000000000 x3 : 0000000000000010 > [ 0.403794] x2 : 000000000000ffff x1 : 0000000000010000 x0 : 00000000ffffffed > [ 0.403798] Call trace: > [ 0.403799] its_cpu_init+0x814/0xae0 > [ 0.403802] gic_starting_cpu+0x48/0x90 > [ 0.403805] cpuhp_invoke_callback+0x16c/0x5b0 > [ 0.403808] cpuhp_invoke_callback_range+0x78/0xf0 > [ 0.403811] notify_cpu_starting+0xbc/0xdc > [ 0.403814] secondary_start_kernel+0xe0/0x170 > [ 0.403817] __secondary_switched+0x94/0x98 > [ 0.403821] ---[ end trace f68728a0d3053b70 ]--- > > In the Linux kernel, the GIC driver tries to reserve ITS interrupt > table, and the reserved pages can survive for kexec/kdump and will be > used for secondary kernel. Linux kernel relies on MEMRESERVE EFI > configuration table for memory page , but this table is not supported > by Xen. > > This patch adds a MEMRESERVE configuration table into the system table, > it points to a dummy data structure acpi_table_memreserve, this > structure has the consistent definition with the Linux kernel. > Following the method in Xen, it creates a table entry for the structure > acpi_table_memreserve. > > Cc: Ard Biesheuvel <ardb@kernel.org> > Cc: Marc Zyngier <maz@kernel.org> > Cc: Bertrand Marquis <Bertrand.Marquis@arm.com> > Cc: Rahul Singh <Rahul.Singh@arm.com> > Cc: Peter Griffin <peter.griffin@linaro.org> > Signed-off-by: Leo Yan <leo.yan@linaro.org> > --- > xen/arch/arm/acpi/domain_build.c | 24 ++++++++++++++++++++++++ > xen/arch/arm/efi/efi-dom0.c | 19 ++++++++++++++++--- > xen/arch/arm/include/asm/acpi.h | 1 + > xen/include/acpi/actbl.h | 17 +++++++++++++++++ > xen/include/efi/efiapi.h | 2 ++ > 5 files changed, 60 insertions(+), 3 deletions(-) Please make sure you Cc all maintainers of all files that you touch. Albeit, see below, you could indeed have avoided Cc-ing me if you hadn't misplaced stuff in two of the headers that you fiddle with. > --- a/xen/arch/arm/efi/efi-dom0.c > +++ b/xen/arch/arm/efi/efi-dom0.c > @@ -38,7 +38,7 @@ size_t __init estimate_efi_size(unsigned int mem_nr_banks) > { > size_t size; > size_t est_size = sizeof(EFI_SYSTEM_TABLE); > - size_t ect_size = sizeof(EFI_CONFIGURATION_TABLE); > + size_t ect_size = sizeof(EFI_CONFIGURATION_TABLE) * 2; > size_t emd_size = sizeof(EFI_MEMORY_DESCRIPTOR); > size_t fw_vendor_size = sizeof(xen_efi_fw_vendor); > unsigned int acpi_mem_nr_banks = 0; > @@ -63,7 +63,8 @@ void __init acpi_create_efi_system_table(struct domain *d, > > table_addr = d->arch.efi_acpi_gpa > + acpi_get_table_offset(tbl_add, TBL_EFIT); > - table_size = sizeof(EFI_SYSTEM_TABLE) + sizeof(EFI_CONFIGURATION_TABLE) > + table_size = sizeof(EFI_SYSTEM_TABLE) > + + sizeof(EFI_CONFIGURATION_TABLE) * 2 > + sizeof(xen_efi_fw_vendor); Nit: Indentation. > @@ -75,7 +76,7 @@ void __init acpi_create_efi_system_table(struct domain *d, > efi_sys_tbl->Hdr.HeaderSize = table_size; > > efi_sys_tbl->FirmwareRevision = 1; > - efi_sys_tbl->NumberOfTableEntries = 1; > + efi_sys_tbl->NumberOfTableEntries = 2; This is the 3rd magic "2" - I think there wants to be some consolidation, such that it becomes obvious which "2"-s really are the same (and would change together if, like you do here, another entry is needed). > @@ -86,6 +87,18 @@ void __init acpi_create_efi_system_table(struct domain *d, > efi_conf_tbl->VendorTable = (VOID *)tbl_add[TBL_RSDP].start; > efi_sys_tbl->ConfigurationTable = (EFI_CONFIGURATION_TABLE *)(table_addr > + offset); > + > + /* > + * Configuration table for MEMRESERVE is used in Linux kernel for > + * reserving pages, its main purpose is used for kexec/kdump to > + * reserve persistent pages (e.g. GIC pending pages) for the secondary > + * kernel. > + */ > + offset += sizeof(EFI_CONFIGURATION_TABLE); > + efi_conf_tbl = (EFI_CONFIGURATION_TABLE *)(base_ptr + offset); > + efi_conf_tbl->VendorGuid = (EFI_GUID)LINUX_EFI_MEMRESERVE_TABLE_GUID; > + efi_conf_tbl->VendorTable = (VOID *)tbl_add[TBL_MRSV].start; > + > xz_crc32_init(); > efi_sys_tbl->Hdr.CRC32 = xz_crc32((uint8_t *)efi_sys_tbl, > efi_sys_tbl->Hdr.HeaderSize, 0); Rather than adjusting offset and calculating efi_conf_table fdrom scratch, perhaps better simply efi_conf_table++? That way there would be one less cast, which are always somewhat risky. > --- a/xen/include/acpi/actbl.h > +++ b/xen/include/acpi/actbl.h > @@ -302,6 +302,23 @@ struct acpi_table_fadt { > #define ACPI_FADT_HW_REDUCED (1<<20) /* 20: [V5] ACPI hardware is not implemented (ACPI 5.0) */ > #define ACPI_FADT_LOW_POWER_S0 (1<<21) /* 21: [V5] S0 power savings are equal or better than S3 (ACPI 5.0) */ > > +/******************************************************************************* > + * > + * MEMRESERVE - Dummy entry for memory reserve configuration table > + * > + ******************************************************************************/ > + > +struct acpi_table_memreserve { > + int size; /* allocated size of the array */ > + int count; /* number of entries used */ > + u64 next; /* pa of next struct instance */ > + struct { > + u64 base; > + u64 size; > + } entry[]; > +}; This header holds ACPI spec defined data structures. This one looks to be a Linux one, and hence shouldn't be defined here. You use it in a single CU only, so I see no reason to define it there. Furthermore - what if Linux decided to change their structure? Or is there a guarantee that they won't? Generally such structures belong in the public interface, guaranteeing forward compatibility even if Linux decided to change / extend theirs (at which point consuming code there would need to do translation, but maybe using a Xen-defined struct [plus translation in Linux] right away would be best). Finally, style-wise, please don't use u64 in new code anymore; we are trying hard to move over to standard uint<N>_t types. Plus, unless indeed mandated by Linux, please avoid signed types for fields (or variables) which can never go negative. > + > + > /* Values for preferred_profile (Preferred Power Management Profiles) */ Please don't add double blank lines anywhere. > --- a/xen/include/efi/efiapi.h > +++ b/xen/include/efi/efiapi.h > @@ -882,6 +882,8 @@ typedef struct _EFI_BOOT_SERVICES { > #define SAL_SYSTEM_TABLE_GUID \ > { 0xeb9d2d32, 0x2d88, 0x11d3, {0x9a, 0x16, 0x0, 0x90, 0x27, 0x3f, 0xc1, 0x4d} } > > +#define LINUX_EFI_MEMRESERVE_TABLE_GUID \ > + { 0x888eb0c6, 0x8ede, 0x4ff5, {0xa8, 0xf0, 0x9a, 0xee, 0x5c, 0xb9, 0x77, 0xc2} } This header holds EFI spec mandated definitions (generally taken from the gnu-efi project), when this one again looks to be a Linux one (and again looks to be used in only a single CU). Jan
On Wed, 17 Aug 2022 at 15:18, Jan Beulich <jbeulich@suse.com> wrote: > > On 17.08.2022 12:57, Leo Yan wrote: > > On Arm64, when boot Dom0 with the Linux kernel, it reports the warning: > > > > [ 0.403737] ------------[ cut here ]------------ > > [ 0.403738] WARNING: CPU: 30 PID: 0 at drivers/irqchip/irq-gic-v3-its.c:3074 its_cpu_init+0x814/0xae0 > > [ 0.403745] Modules linked in: > > [ 0.403748] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G W 5.15.23-ampere-lts-standard #1 > > [ 0.403752] pstate: 600001c5 (nZCv dAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > [ 0.403755] pc : its_cpu_init+0x814/0xae0 > > [ 0.403758] lr : its_cpu_init+0x810/0xae0 > > [ 0.403761] sp : ffff800009c03ce0 > > [ 0.403762] x29: ffff800009c03ce0 x28: 000000000000001e x27: ffff880711f43000 > > [ 0.403767] x26: ffff80000a3c0070 x25: fffffc1ffe0a4400 x24: ffff80000a3c0000 > > [ 0.403770] x23: ffff8000095bc998 x22: ffff8000090a6000 x21: ffff800009850cb0 > > [ 0.403774] x20: ffff800009701a10 x19: ffff800009701000 x18: ffffffffffffffff > > [ 0.403777] x17: 3030303035303031 x16: 3030313030303078 x15: 303a30206e6f6967 > > [ 0.403780] x14: 6572206530312072 x13: 3030303030353030 x12: 3130303130303030 > > [ 0.403784] x11: 78303a30206e6f69 x10: 6765722065303120 x9 : ffff80000870e710 > > [ 0.403788] x8 : 6964657220646e75 x7 : 0000000000000003 x6 : 0000000000000000 > > [ 0.403791] x5 : 0000000000000000 x4 : fffffc0000000000 x3 : 0000000000000010 > > [ 0.403794] x2 : 000000000000ffff x1 : 0000000000010000 x0 : 00000000ffffffed > > [ 0.403798] Call trace: > > [ 0.403799] its_cpu_init+0x814/0xae0 > > [ 0.403802] gic_starting_cpu+0x48/0x90 > > [ 0.403805] cpuhp_invoke_callback+0x16c/0x5b0 > > [ 0.403808] cpuhp_invoke_callback_range+0x78/0xf0 > > [ 0.403811] notify_cpu_starting+0xbc/0xdc > > [ 0.403814] secondary_start_kernel+0xe0/0x170 > > [ 0.403817] __secondary_switched+0x94/0x98 > > [ 0.403821] ---[ end trace f68728a0d3053b70 ]--- > > > > In the Linux kernel, the GIC driver tries to reserve ITS interrupt > > table, and the reserved pages can survive for kexec/kdump and will be > > used for secondary kernel. Linux kernel relies on MEMRESERVE EFI > > configuration table for memory page , but this table is not supported > > by Xen. > > > > This patch adds a MEMRESERVE configuration table into the system table, > > it points to a dummy data structure acpi_table_memreserve, this > > structure has the consistent definition with the Linux kernel. > > Following the method in Xen, it creates a table entry for the structure > > acpi_table_memreserve. > > > > Cc: Ard Biesheuvel <ardb@kernel.org> > > Cc: Marc Zyngier <maz@kernel.org> > > Cc: Bertrand Marquis <Bertrand.Marquis@arm.com> > > Cc: Rahul Singh <Rahul.Singh@arm.com> > > Cc: Peter Griffin <peter.griffin@linaro.org> > > Signed-off-by: Leo Yan <leo.yan@linaro.org> > > --- > > xen/arch/arm/acpi/domain_build.c | 24 ++++++++++++++++++++++++ > > xen/arch/arm/efi/efi-dom0.c | 19 ++++++++++++++++--- > > xen/arch/arm/include/asm/acpi.h | 1 + > > xen/include/acpi/actbl.h | 17 +++++++++++++++++ > > xen/include/efi/efiapi.h | 2 ++ > > 5 files changed, 60 insertions(+), 3 deletions(-) > ... > > --- a/xen/include/acpi/actbl.h > > +++ b/xen/include/acpi/actbl.h > > @@ -302,6 +302,23 @@ struct acpi_table_fadt { > > #define ACPI_FADT_HW_REDUCED (1<<20) /* 20: [V5] ACPI hardware is not implemented (ACPI 5.0) */ > > #define ACPI_FADT_LOW_POWER_S0 (1<<21) /* 21: [V5] S0 power savings are equal or better than S3 (ACPI 5.0) */ > > > > +/******************************************************************************* > > + * > > + * MEMRESERVE - Dummy entry for memory reserve configuration table > > + * > > + ******************************************************************************/ > > + > > +struct acpi_table_memreserve { > > + int size; /* allocated size of the array */ > > + int count; /* number of entries used */ > > + u64 next; /* pa of next struct instance */ > > + struct { > > + u64 base; > > + u64 size; > > + } entry[]; > > +}; > > This header holds ACPI spec defined data structures. This one looks > to be a Linux one, and hence shouldn't be defined here. You use it > in a single CU only, so I see no reason to define it there. > > Furthermore - what if Linux decided to change their structure? Or > is there a guarantee that they won't? No, there is not. The memreserve table is an internal interface between the EFI stub loader and the Linux kernel proper. As I have argued many times before, booting the arm64 kernel in EFI/ACPI mode without going through the EFI stub violates the ACPI spec, and relies on interfaces that were not intended for public consumption. > Generally such structures > belong in the public interface, guaranteeing forward compatibility > even if Linux decided to change / extend theirs (at which point > consuming code there would need to do translation, but maybe using > a Xen-defined struct [plus translation in Linux] right away would > be best). > > > --- a/xen/include/efi/efiapi.h > > +++ b/xen/include/efi/efiapi.h > > @@ -882,6 +882,8 @@ typedef struct _EFI_BOOT_SERVICES { > > #define SAL_SYSTEM_TABLE_GUID \ > > { 0xeb9d2d32, 0x2d88, 0x11d3, {0x9a, 0x16, 0x0, 0x90, 0x27, 0x3f, 0xc1, 0x4d} } > > > > +#define LINUX_EFI_MEMRESERVE_TABLE_GUID \ > > + { 0x888eb0c6, 0x8ede, 0x4ff5, {0xa8, 0xf0, 0x9a, 0xee, 0x5c, 0xb9, 0x77, 0xc2} } > > This header holds EFI spec mandated definitions (generally taken > from the gnu-efi project), when this one again looks to be a Linux > one (and again looks to be used in only a single CU). > Same point as above.
On Wed, Aug 17, 2022 at 03:17:53PM +0200, Jan Beulich wrote: [...] > Please make sure you Cc all maintainers of all files that you touch. > Albeit, see below, you could indeed have avoided Cc-ing me if you > hadn't misplaced stuff in two of the headers that you fiddle with. Sorry for this. When I send a new patch in next time, I will use ./scripts/get_maintainer.pl to find out the maintainers. > > --- a/xen/arch/arm/efi/efi-dom0.c > > +++ b/xen/arch/arm/efi/efi-dom0.c > > @@ -38,7 +38,7 @@ size_t __init estimate_efi_size(unsigned int mem_nr_banks) > > { > > size_t size; > > size_t est_size = sizeof(EFI_SYSTEM_TABLE); > > - size_t ect_size = sizeof(EFI_CONFIGURATION_TABLE); > > + size_t ect_size = sizeof(EFI_CONFIGURATION_TABLE) * 2; > > size_t emd_size = sizeof(EFI_MEMORY_DESCRIPTOR); > > size_t fw_vendor_size = sizeof(xen_efi_fw_vendor); > > unsigned int acpi_mem_nr_banks = 0; > > @@ -63,7 +63,8 @@ void __init acpi_create_efi_system_table(struct domain *d, > > > > table_addr = d->arch.efi_acpi_gpa > > + acpi_get_table_offset(tbl_add, TBL_EFIT); > > - table_size = sizeof(EFI_SYSTEM_TABLE) + sizeof(EFI_CONFIGURATION_TABLE) > > + table_size = sizeof(EFI_SYSTEM_TABLE) > > + + sizeof(EFI_CONFIGURATION_TABLE) * 2 > > + sizeof(xen_efi_fw_vendor); > > Nit: Indentation. Will fix. > > @@ -75,7 +76,7 @@ void __init acpi_create_efi_system_table(struct domain *d, > > efi_sys_tbl->Hdr.HeaderSize = table_size; > > > > efi_sys_tbl->FirmwareRevision = 1; > > - efi_sys_tbl->NumberOfTableEntries = 1; > > + efi_sys_tbl->NumberOfTableEntries = 2; > > This is the 3rd magic "2" - I think there wants to be some consolidation, > such that it becomes obvious which "2"-s really are the same (and would > change together if, like you do here, another entry is needed). I will define a macro for the number of configuration table and add comment for it. > > @@ -86,6 +87,18 @@ void __init acpi_create_efi_system_table(struct domain *d, > > efi_conf_tbl->VendorTable = (VOID *)tbl_add[TBL_RSDP].start; > > efi_sys_tbl->ConfigurationTable = (EFI_CONFIGURATION_TABLE *)(table_addr > > + offset); > > + > > + /* > > + * Configuration table for MEMRESERVE is used in Linux kernel for > > + * reserving pages, its main purpose is used for kexec/kdump to > > + * reserve persistent pages (e.g. GIC pending pages) for the secondary > > + * kernel. > > + */ > > + offset += sizeof(EFI_CONFIGURATION_TABLE); > > + efi_conf_tbl = (EFI_CONFIGURATION_TABLE *)(base_ptr + offset); > > + efi_conf_tbl->VendorGuid = (EFI_GUID)LINUX_EFI_MEMRESERVE_TABLE_GUID; > > + efi_conf_tbl->VendorTable = (VOID *)tbl_add[TBL_MRSV].start; > > + > > xz_crc32_init(); > > efi_sys_tbl->Hdr.CRC32 = xz_crc32((uint8_t *)efi_sys_tbl, > > efi_sys_tbl->Hdr.HeaderSize, 0); > > Rather than adjusting offset and calculating efi_conf_table fdrom scratch, > perhaps better simply efi_conf_table++? That way there would be one less > cast, which are always somewhat risky. Yeah, using "efi_conf_table++" is much better. Will do. > > --- a/xen/include/acpi/actbl.h > > +++ b/xen/include/acpi/actbl.h > > @@ -302,6 +302,23 @@ struct acpi_table_fadt { > > #define ACPI_FADT_HW_REDUCED (1<<20) /* 20: [V5] ACPI hardware is not implemented (ACPI 5.0) */ > > #define ACPI_FADT_LOW_POWER_S0 (1<<21) /* 21: [V5] S0 power savings are equal or better than S3 (ACPI 5.0) */ > > > > +/******************************************************************************* > > + * > > + * MEMRESERVE - Dummy entry for memory reserve configuration table > > + * > > + ******************************************************************************/ > > + > > +struct acpi_table_memreserve { > > + int size; /* allocated size of the array */ > > + int count; /* number of entries used */ > > + u64 next; /* pa of next struct instance */ > > + struct { > > + u64 base; > > + u64 size; > > + } entry[]; > > +}; > > This header holds ACPI spec defined data structures. This one looks > to be a Linux one, and hence shouldn't be defined here. You use it > in a single CU only, so I see no reason to define it there. Okay, I will define the structure in the arm specific file, e.g. I move it to the file xen/arch/arm/acpi/domain_build.c. > Furthermore - what if Linux decided to change their structure? Or > is there a guarantee that they won't? Generally such structures > belong in the public interface, guaranteeing forward compatibility > even if Linux decided to change / extend theirs (at which point > consuming code there would need to do translation, but maybe using > a Xen-defined struct [plus translation in Linux] right away would > be best). I saw Ard has helped to answer this question in his email. As Ard said, the general way is to rely on Linux EFI stub to allocate the data structure for MEMRESERVE configuration table. Given Xen uses pseudo EFI booting (the ACPI table is passed via DT), in short term I don't think Xen can support Linux EFI stub, so we need to maintain this structure in Xen as well. This structure eventually will not change frequently, so I assume later we will have little effort for maintainence it. > Finally, style-wise, please don't use u64 in new code anymore; we > are trying hard to move over to standard uint<N>_t types. Plus, > unless indeed mandated by Linux, please avoid signed types for > fields (or variables) which can never go negative. Sure, will follow this and update the data structure. > > + > > + > > /* Values for preferred_profile (Preferred Power Management Profiles) */ > > Please don't add double blank lines anywhere. Will do. > > --- a/xen/include/efi/efiapi.h > > +++ b/xen/include/efi/efiapi.h > > @@ -882,6 +882,8 @@ typedef struct _EFI_BOOT_SERVICES { > > #define SAL_SYSTEM_TABLE_GUID \ > > { 0xeb9d2d32, 0x2d88, 0x11d3, {0x9a, 0x16, 0x0, 0x90, 0x27, 0x3f, 0xc1, 0x4d} } > > > > +#define LINUX_EFI_MEMRESERVE_TABLE_GUID \ > > + { 0x888eb0c6, 0x8ede, 0x4ff5, {0xa8, 0xf0, 0x9a, 0xee, 0x5c, 0xb9, 0x77, 0xc2} } > > This header holds EFI spec mandated definitions (generally taken > from the gnu-efi project), when this one again looks to be a Linux > one (and again looks to be used in only a single CU). I will move this macro into the file xen/arch/arm/efi/efi-dom0.c, so far only Arm64 platform uses it. Thanks, Leo
On 18.08.2022 09:34, Leo Yan wrote: > On Wed, Aug 17, 2022 at 03:17:53PM +0200, Jan Beulich wrote: >> Furthermore - what if Linux decided to change their structure? Or >> is there a guarantee that they won't? Generally such structures >> belong in the public interface, guaranteeing forward compatibility >> even if Linux decided to change / extend theirs (at which point >> consuming code there would need to do translation, but maybe using >> a Xen-defined struct [plus translation in Linux] right away would >> be best). > > I saw Ard has helped to answer this question in his email. As Ard > said, the general way is to rely on Linux EFI stub to allocate the > data structure for MEMRESERVE configuration table. > > Given Xen uses pseudo EFI booting (the ACPI table is passed via DT), in > short term I don't think Xen can support Linux EFI stub, so we need to > maintain this structure in Xen as well. > > This structure eventually will not change frequently, so I assume > later we will have little effort for maintainence it. "Will not change frequently" isn't enough. Imo there needs to be a public interface structure in Xen and translation code in Linux. That's the only way the consuming code in Linux can remain (largely) independent of changes to the structure in Linux (i.e. changes there can be expected to be accompanied by updating of this code, perhaps simply in order to keep things building). Jan
Hi Leo, On 18/08/2022 08:34, Leo Yan wrote: > On Wed, Aug 17, 2022 at 03:17:53PM +0200, Jan Beulich wrote: >> Furthermore - what if Linux decided to change their structure? Or >> is there a guarantee that they won't? Generally such structures >> belong in the public interface, guaranteeing forward compatibility >> even if Linux decided to change / extend theirs (at which point >> consuming code there would need to do translation, but maybe using >> a Xen-defined struct [plus translation in Linux] right away would >> be best). > > I saw Ard has helped to answer this question in his email. As Ard > said, the general way is to rely on Linux EFI stub to allocate the > data structure for MEMRESERVE configuration table. > > Given Xen uses pseudo EFI booting (the ACPI table is passed via DT), in > short term I don't think Xen can support Linux EFI stub, I agree that using Linux EFI stub will require more effort. I would also need to go through the mailing list archives (or maybe Stefano remember?) to understand why we decided against using the EFI stub. > so we need to > maintain this structure in Xen as well. I have looked at the commit message. IIUC, if this table is not created then Kexec will not work. Is there anything else that would not work? IOW, would Linux be unusable if we don't create MEMRESERVE? > > This structure eventually will not change frequently, so I assume > later we will have little effort for maintainence it The problem is not really "maintainance" here. It is more that if Linux is updating the structure in a non-backward compatible way, then new version of Linux would not boot on older Xen. We would have similar probable with new Xen booting older Xen because the hypervisor doesn't know (and should not need to know) which OS is used. In the nutshells, Xen should only use stable interface. If MEMRESERVE is really necessary then we should either switch to the Linux EFI stub or standardize MEMRESERVE. Cheers,
Hi Julien, On Thu, Aug 18, 2022 at 08:57:20AM +0100, Julien Grall wrote: > Hi Leo, > > On 18/08/2022 08:34, Leo Yan wrote: > > On Wed, Aug 17, 2022 at 03:17:53PM +0200, Jan Beulich wrote: > > > Furthermore - what if Linux decided to change their structure? Or > > > is there a guarantee that they won't? Generally such structures > > > belong in the public interface, guaranteeing forward compatibility > > > even if Linux decided to change / extend theirs (at which point > > > consuming code there would need to do translation, but maybe using > > > a Xen-defined struct [plus translation in Linux] right away would > > > be best). > > > > I saw Ard has helped to answer this question in his email. As Ard > > said, the general way is to rely on Linux EFI stub to allocate the > > data structure for MEMRESERVE configuration table. > > > > Given Xen uses pseudo EFI booting (the ACPI table is passed via DT), in > > short term I don't think Xen can support Linux EFI stub, > > I agree that using Linux EFI stub will require more effort. I would also > need to go through the mailing list archives (or maybe Stefano remember?) to > understand why we decided against using the EFI stub. Yeah, seems to me using the EFI stub is the right thing to do. I read Xen code and found Xen doesn't provide boot time callback, this is the main reason blocked me to move forward to that way. > > so we need to > > maintain this structure in Xen as well. > > I have looked at the commit message. IIUC, if this table is not created then > Kexec will not work. Is there anything else that would not work? No, AFAIK, kexec/kdump is the only customer to use the persistent reserved memory in the Linux kernel. I personally think this will pontentially impact other kernel debugging modules, like ramoops, it also needs to use persistent reserved memory when rebooting the kernel. But now kernel code doesn't add ramoops memory region into EFI MEMRESERVE table. > IOW, would Linux be unusable if we don't create MEMRESERVE? If we don't create EFI MEMRESERVE, kernel still can boot up successfully, though it cannot add pages into the reserved memory table and reports oops. Without EFI MEMRESERVE, the reserved memory table cannot be passed from the primary kernel to the secondary kernel, this is why it's important for debugging tools. > > This structure eventually will not change frequently, so I assume > > later we will have little effort for maintainence it > > The problem is not really "maintainance" here. It is more that if Linux is > updating the structure in a non-backward compatible way, then new > version of Linux would not boot on older Xen. > > We would have similar probable with new Xen booting older Xen because the > hypervisor doesn't know (and should not need to know) which OS is used. Fair point. > In the nutshells, Xen should only use stable interface. If MEMRESERVE is > really necessary then we should either switch to the Linux EFI stub or > standardize MEMRESERVE. Yeah, here I really need your (and other maintainers) opinions. Seems to me, support Linux EFI stub would be the best thing to do, to be honest, this is a task which is out of my capability :) Current approach in this patch is not perfect, it lets things to work. So please let me know the conclusion from your side. Thanks, Leo
Hi Jan, On Thu, Aug 18, 2022 at 09:47:46AM +0200, Jan Beulich wrote: > On 18.08.2022 09:34, Leo Yan wrote: > > On Wed, Aug 17, 2022 at 03:17:53PM +0200, Jan Beulich wrote: > >> Furthermore - what if Linux decided to change their structure? Or > >> is there a guarantee that they won't? Generally such structures > >> belong in the public interface, guaranteeing forward compatibility > >> even if Linux decided to change / extend theirs (at which point > >> consuming code there would need to do translation, but maybe using > >> a Xen-defined struct [plus translation in Linux] right away would > >> be best). > > > > I saw Ard has helped to answer this question in his email. As Ard > > said, the general way is to rely on Linux EFI stub to allocate the > > data structure for MEMRESERVE configuration table. > > > > Given Xen uses pseudo EFI booting (the ACPI table is passed via DT), in > > short term I don't think Xen can support Linux EFI stub, so we need to > > maintain this structure in Xen as well. > > > > This structure eventually will not change frequently, so I assume > > later we will have little effort for maintainence it. > > "Will not change frequently" isn't enough. Imo there needs to be a > public interface structure in Xen and translation code in Linux. > That's the only way the consuming code in Linux can remain (largely) > independent of changes to the structure in Linux (i.e. changes there > can be expected to be accompanied by updating of this code, perhaps > simply in order to keep things building). Yes, actually Xen doesn't care about the structure definition for linux_efi_memreserve, it just allocates the table and finally used by Linux kernel. So another way is we can simply allocate a bigger memory region (e.g. 64 bytes), which is sufficient than kernel's structure linux_efi_memreserve size (only 16 bytes), then we can initilize it as all zeros, and this can be helpful if later kernel extend the data structure. But this method is a bit arbitrary, this is why I did't write like this. As Julien said, I think the critical thing is to make a call to support EFI stub or not. If rollback to use current method, then I am happy to refine the patch with above idea. Thanks, Leo
On 18.08.2022 10:46, Leo Yan wrote: > On Thu, Aug 18, 2022 at 09:47:46AM +0200, Jan Beulich wrote: >> On 18.08.2022 09:34, Leo Yan wrote: >>> On Wed, Aug 17, 2022 at 03:17:53PM +0200, Jan Beulich wrote: >>>> Furthermore - what if Linux decided to change their structure? Or >>>> is there a guarantee that they won't? Generally such structures >>>> belong in the public interface, guaranteeing forward compatibility >>>> even if Linux decided to change / extend theirs (at which point >>>> consuming code there would need to do translation, but maybe using >>>> a Xen-defined struct [plus translation in Linux] right away would >>>> be best). >>> >>> I saw Ard has helped to answer this question in his email. As Ard >>> said, the general way is to rely on Linux EFI stub to allocate the >>> data structure for MEMRESERVE configuration table. >>> >>> Given Xen uses pseudo EFI booting (the ACPI table is passed via DT), in >>> short term I don't think Xen can support Linux EFI stub, so we need to >>> maintain this structure in Xen as well. >>> >>> This structure eventually will not change frequently, so I assume >>> later we will have little effort for maintainence it. >> >> "Will not change frequently" isn't enough. Imo there needs to be a >> public interface structure in Xen and translation code in Linux. >> That's the only way the consuming code in Linux can remain (largely) >> independent of changes to the structure in Linux (i.e. changes there >> can be expected to be accompanied by updating of this code, perhaps >> simply in order to keep things building). > > Yes, actually Xen doesn't care about the structure definition for > linux_efi_memreserve, it just allocates the table and finally used by > Linux kernel. So another way is we can simply allocate a bigger > memory region (e.g. 64 bytes), which is sufficient than kernel's > structure linux_efi_memreserve size (only 16 bytes), then we can > initilize it as all zeros, and this can be helpful if later kernel > extend the data structure. Well, no, that's not how one would define a structure which can, in a backwards compatible manner, be extended down the road. > But this method is a bit arbitrary, this is why I did't write like this. > As Julien said, I think the critical thing is to make a call to support > EFI stub or not. I don't see how this can be a reasonable option - we'd have to re- implement an almost complete EFI (to cover everything one can do via boot or runtime services). But I'm not an Arm maintainer, so my view here is at best advisory. Jan
Hi Ard, On Wed, Aug 17, 2022 at 03:49:32PM +0200, Ard Biesheuvel wrote: [...] > > This header holds ACPI spec defined data structures. This one looks > > to be a Linux one, and hence shouldn't be defined here. You use it > > in a single CU only, so I see no reason to define it there. > > > > Furthermore - what if Linux decided to change their structure? Or > > is there a guarantee that they won't? > > No, there is not. The memreserve table is an internal interface > between the EFI stub loader and the Linux kernel proper. > > As I have argued many times before, booting the arm64 kernel in > EFI/ACPI mode without going through the EFI stub violates the ACPI > spec, and relies on interfaces that were not intended for public > consumption. Let me ask a question but sorry it might be off topic. In the Linux kernel function: static int gic_reserve_range(phys_addr_t addr, unsigned long size) { if (efi_enabled(EFI_CONFIG_TABLES)) return efi_mem_reserve_persistent(addr, size); return 0; } We can see it will reserve persistent memory region for GIC pending pages, so my question is if a platform is booting with DT binding rather than ACPI table, how the primary kernel can reserve the pages and pass the info to the secondary kernel? Seems it's broken for kdump/kexec if kernel boots with using DT? Thanks, Leo
On Thu, 18 Aug 2022 at 11:15, Leo Yan <leo.yan@linaro.org> wrote: > > Hi Ard, > > On Wed, Aug 17, 2022 at 03:49:32PM +0200, Ard Biesheuvel wrote: > > [...] > > > > This header holds ACPI spec defined data structures. This one looks > > > to be a Linux one, and hence shouldn't be defined here. You use it > > > in a single CU only, so I see no reason to define it there. > > > > > > Furthermore - what if Linux decided to change their structure? Or > > > is there a guarantee that they won't? > > > > No, there is not. The memreserve table is an internal interface > > between the EFI stub loader and the Linux kernel proper. > > > > As I have argued many times before, booting the arm64 kernel in > > EFI/ACPI mode without going through the EFI stub violates the ACPI > > spec, and relies on interfaces that were not intended for public > > consumption. > > Let me ask a question but sorry it might be off topic. > > In the Linux kernel function: > > static int gic_reserve_range(phys_addr_t addr, unsigned long size) > { > if (efi_enabled(EFI_CONFIG_TABLES)) > return efi_mem_reserve_persistent(addr, size); > > return 0; > } > > We can see it will reserve persistent memory region for GIC pending > pages, so my question is if a platform is booting with DT binding > rather than ACPI table, how the primary kernel can reserve the pages > and pass the info to the secondary kernel? > > Seems it's broken for kdump/kexec if kernel boots with using DT? > EFI supports both DT and ACPI boot, but only ACPI *requires* EFI. So DT boot on hardware with affected GICv3 implementations works fine with kdump/kexec as long as EFI is being used. Using non-EFI boot and kexec on such systems will likely result in a splat on the second kernel, unless there is another mechanism being used to reserve the memory in DT as well. Maybe we should wire up the EFI_PARAVIRT flag for Xen dom0 on arm64, so that we can filter out the call above. That would mean that Xen/arm64/dom0/kexec on affected hardware would result in a splat in the second kernel as well, but whether that matters depends on your support scope.
On Thu, 18 Aug 2022 10:15:30 +0100, Leo Yan <leo.yan@linaro.org> wrote: > > Hi Ard, > > On Wed, Aug 17, 2022 at 03:49:32PM +0200, Ard Biesheuvel wrote: > > [...] > > > > This header holds ACPI spec defined data structures. This one looks > > > to be a Linux one, and hence shouldn't be defined here. You use it > > > in a single CU only, so I see no reason to define it there. > > > > > > Furthermore - what if Linux decided to change their structure? Or > > > is there a guarantee that they won't? > > > > No, there is not. The memreserve table is an internal interface > > between the EFI stub loader and the Linux kernel proper. > > > > As I have argued many times before, booting the arm64 kernel in > > EFI/ACPI mode without going through the EFI stub violates the ACPI > > spec, and relies on interfaces that were not intended for public > > consumption. > > Let me ask a question but sorry it might be off topic. > > In the Linux kernel function: > > static int gic_reserve_range(phys_addr_t addr, unsigned long size) > { > if (efi_enabled(EFI_CONFIG_TABLES)) > return efi_mem_reserve_persistent(addr, size); > > return 0; > } > > We can see it will reserve persistent memory region for GIC pending > pages, so my question is if a platform is booting with DT binding > rather than ACPI table, how the primary kernel can reserve the pages > and pass the info to the secondary kernel? This is a false dichotomy. DT and UEFI are not exclusive, far from it. That's actually how most non-ACPI systems boot these days, and they are able to use kexec out of the box, using the EFI MEMRESERVE internal API. The real difference is between UEFI and non-UEFI. If you're allergic to it, you have two options: - you delegate the redistributor configuration to your bootloader, mark the corresponding memory as reserved in the DT from the bootloader, and you're done. A bunch of embedded systems do that, and are able to kexec. - you keep configuring the RDs from Linux, but you must then mark the regions as reserved in the DT that is passed to the secondary kernel. It requires some cooperation from the kexec userspace to parse /proc/iomem and insert the correct annotations in that DT. > Seems it's broken for kdump/kexec if kernel boots with using DT? You equate kexec to kdump, which is wrong. kdump does *not* reuse the memory from the previous kernel, and thus is immune to the RD table reallocation problem. You don't need anything for kdump, as you will use the RD tables as configured by the initial kernel. M.
Hi Ard, On 18/08/2022 10:33, Ard Biesheuvel wrote: > On Thu, 18 Aug 2022 at 11:15, Leo Yan <leo.yan@linaro.org> wrote: >> >> Hi Ard, >> >> On Wed, Aug 17, 2022 at 03:49:32PM +0200, Ard Biesheuvel wrote: >> >> [...] >> >>>> This header holds ACPI spec defined data structures. This one looks >>>> to be a Linux one, and hence shouldn't be defined here. You use it >>>> in a single CU only, so I see no reason to define it there. >>>> >>>> Furthermore - what if Linux decided to change their structure? Or >>>> is there a guarantee that they won't? >>> >>> No, there is not. The memreserve table is an internal interface >>> between the EFI stub loader and the Linux kernel proper. >>> >>> As I have argued many times before, booting the arm64 kernel in >>> EFI/ACPI mode without going through the EFI stub violates the ACPI >>> spec, and relies on interfaces that were not intended for public >>> consumption. >> >> Let me ask a question but sorry it might be off topic. >> >> In the Linux kernel function: >> >> static int gic_reserve_range(phys_addr_t addr, unsigned long size) >> { >> if (efi_enabled(EFI_CONFIG_TABLES)) >> return efi_mem_reserve_persistent(addr, size); >> >> return 0; >> } >> >> We can see it will reserve persistent memory region for GIC pending >> pages, so my question is if a platform is booting with DT binding >> rather than ACPI table, how the primary kernel can reserve the pages >> and pass the info to the secondary kernel? >> >> Seems it's broken for kdump/kexec if kernel boots with using DT? >> > > EFI supports both DT and ACPI boot, but only ACPI *requires* EFI. > > So DT boot on hardware with affected GICv3 implementations works fine > with kdump/kexec as long as EFI is being used. Using non-EFI boot and > kexec on such systems will likely result in a splat on the second > kernel, unless there is another mechanism being used to reserve the > memory in DT as well. > > Maybe we should wire up the EFI_PARAVIRT flag for Xen dom0 on arm64, > so that we can filter out the call above. That would mean that > Xen/arm64/dom0/kexec on affected hardware would result in a splat in > the second kernel as well, but whether that matters depends on your > support scope. In the context of Xen, dom0 doesn't have direct access to the host ITS because we are emulating it. So I think it doesn't matter for us because we can fix our implementation if it is affected. That said, kexec-ing dom0 (or any other domain) on Xen on Arm would require some work to be supported. OOI, @leo is it something you are investigating? Cheers,
Hi Julien, > On 18 Aug 2022, at 08:57, Julien Grall <julien@xen.org> wrote: > > Hi Leo, > > On 18/08/2022 08:34, Leo Yan wrote: >> On Wed, Aug 17, 2022 at 03:17:53PM +0200, Jan Beulich wrote: >>> Furthermore - what if Linux decided to change their structure? Or >>> is there a guarantee that they won't? Generally such structures >>> belong in the public interface, guaranteeing forward compatibility >>> even if Linux decided to change / extend theirs (at which point >>> consuming code there would need to do translation, but maybe using >>> a Xen-defined struct [plus translation in Linux] right away would >>> be best). >> I saw Ard has helped to answer this question in his email. As Ard >> said, the general way is to rely on Linux EFI stub to allocate the >> data structure for MEMRESERVE configuration table. >> Given Xen uses pseudo EFI booting (the ACPI table is passed via DT), in >> short term I don't think Xen can support Linux EFI stub, > > I agree that using Linux EFI stub will require more effort. I would also need to go through the mailing list archives (or maybe Stefano remember?) to understand why we decided against using the EFI stub. I think the problem was that EFI also includes some functions and to have a proper EFI stub we would need to support those (console, mapping, disk access). Maybe this is something we could discuss but that would require to have some kind of binary providing those services that we put in EL1 or at least a stub calling Xen for those functions. One other solution of course is to run uboot and boot linux from this. > >> so we need to >> maintain this structure in Xen as well. > > I have looked at the commit message. IIUC, if this table is not created then Kexec will not work. Is there anything else that would not work? > > IOW, would Linux be unusable if we don't create MEMRESERVE? No it works perfectly fine. There are some warnings and as pointed out later kexec is not useable (but is not supported anyway). > >> This structure eventually will not change frequently, so I assume >> later we will have little effort for maintainence it > The problem is not really "maintainance" here. It is more that if Linux is updating the structure in a non-backward compatible way, then new > version of Linux would not boot on older Xen. > > We would have similar probable with new Xen booting older Xen because the hypervisor doesn't know (and should not need to know) which OS is used. > > In the nutshells, Xen should only use stable interface. If MEMRESERVE is really necessary then we should either switch to the Linux EFI stub or standardize MEMRESERVE. As we do support kexec on arm right now I think the current status should be kept. Those questions would need answering if we support kexec one day. Cheers Bertrand > > Cheers, > > -- > Julien Grall
On Thu, Aug 18, 2022 at 11:04:48AM +0100, Julien Grall wrote: [...] > > > Seems it's broken for kdump/kexec if kernel boots with using DT? > > > > > > > EFI supports both DT and ACPI boot, but only ACPI *requires* EFI. > > > > So DT boot on hardware with affected GICv3 implementations works fine > > with kdump/kexec as long as EFI is being used. Using non-EFI boot and > > kexec on such systems will likely result in a splat on the second > > kernel, unless there is another mechanism being used to reserve the > > memory in DT as well. > > > > Maybe we should wire up the EFI_PARAVIRT flag for Xen dom0 on arm64, > > so that we can filter out the call above. That would mean that > > Xen/arm64/dom0/kexec on affected hardware would result in a splat in > > the second kernel as well, but whether that matters depends on your > > support scope. > > In the context of Xen, dom0 doesn't have direct access to the host ITS > because we are emulating it. So I think it doesn't matter for us because we > can fix our implementation if it is affected. > > That said, kexec-ing dom0 (or any other domain) on Xen on Arm would require > some work to be supported. OOI, @leo is it something you are investigating? Now I am working on automative reference platform; the first thing for me is to resolve the kernel oops. For long term, I think the kexec/kdump would be important to be supported, to be clear, so far supporting kexec/kdump for Xen/Linux is not priority for my work. Also thanks a lot for Ard and Mark's replying. To be honest, I missed many prerequisites (e.g. redistributor configurations for GIC in hypervisor) and seems Xen uses a different way by emulating GICv3 controller for guest OS. So now I am bit puzzle what's for next step or just keep as it is? Thanks, Leo
On Thu, 18 Aug 2022 at 17:49, Leo Yan <leo.yan@linaro.org> wrote: > > On Thu, Aug 18, 2022 at 11:04:48AM +0100, Julien Grall wrote: > > [...] > > > > > Seems it's broken for kdump/kexec if kernel boots with using DT? > > > > > > > > > > EFI supports both DT and ACPI boot, but only ACPI *requires* EFI. > > > > > > So DT boot on hardware with affected GICv3 implementations works fine > > > with kdump/kexec as long as EFI is being used. Using non-EFI boot and > > > kexec on such systems will likely result in a splat on the second > > > kernel, unless there is another mechanism being used to reserve the > > > memory in DT as well. > > > > > > Maybe we should wire up the EFI_PARAVIRT flag for Xen dom0 on arm64, > > > so that we can filter out the call above. That would mean that > > > Xen/arm64/dom0/kexec on affected hardware would result in a splat in > > > the second kernel as well, but whether that matters depends on your > > > support scope. > > > > In the context of Xen, dom0 doesn't have direct access to the host ITS > > because we are emulating it. So I think it doesn't matter for us because we > > can fix our implementation if it is affected. > > > > That said, kexec-ing dom0 (or any other domain) on Xen on Arm would require > > some work to be supported. OOI, @leo is it something you are investigating? > > > Now I am working on automative reference platform; the first thing for > me is to resolve the kernel oops. > > For long term, I think the kexec/kdump would be important to be > supported, to be clear, so far supporting kexec/kdump for Xen/Linux is > not priority for my work. > > Also thanks a lot for Ard and Mark's replying. To be honest, I missed > many prerequisites (e.g. redistributor configurations for GIC in > hypervisor) and seems Xen uses a different way by emulating GICv3 > controller for guest OS. So now I am bit puzzle what's for next step > or just keep as it is? > If i understand Julien's remark correctly, the dom0 GICv3 is emulated, and so it should not suffer from the issue that we are working around here. Given that this workaround is still the sole user of the MEMRESERVE API, we would like to remain free to rip it out and replace it completely if we need to, and so implementing it in Xen is a bad idea, especially if the issue in question does not even exist in your case. This means that even if you do decide to support kexec, things should work fine in spite of this warning regarding the failure to perform the memory reservation, as the GIC can simply be reconfigured. So perhaps we should just [conditionally] tone down the warning?
On Thu, 18 Aug 2022 17:24:31 +0100, Ard Biesheuvel <ardb@kernel.org> wrote: > > On Thu, 18 Aug 2022 at 17:49, Leo Yan <leo.yan@linaro.org> wrote: > > > > On Thu, Aug 18, 2022 at 11:04:48AM +0100, Julien Grall wrote: > > > > [...] > > > > > > > Seems it's broken for kdump/kexec if kernel boots with using DT? > > > > > > > > > > > > > EFI supports both DT and ACPI boot, but only ACPI *requires* EFI. > > > > > > > > So DT boot on hardware with affected GICv3 implementations works fine > > > > with kdump/kexec as long as EFI is being used. Using non-EFI boot and > > > > kexec on such systems will likely result in a splat on the second > > > > kernel, unless there is another mechanism being used to reserve the > > > > memory in DT as well. > > > > > > > > Maybe we should wire up the EFI_PARAVIRT flag for Xen dom0 on arm64, > > > > so that we can filter out the call above. That would mean that > > > > Xen/arm64/dom0/kexec on affected hardware would result in a splat in > > > > the second kernel as well, but whether that matters depends on your > > > > support scope. > > > > > > In the context of Xen, dom0 doesn't have direct access to the host ITS > > > because we are emulating it. So I think it doesn't matter for us because we > > > can fix our implementation if it is affected. > > > > > > That said, kexec-ing dom0 (or any other domain) on Xen on Arm would require > > > some work to be supported. OOI, @leo is it something you are investigating? > > > > > > Now I am working on automative reference platform; the first thing for > > me is to resolve the kernel oops. > > > > For long term, I think the kexec/kdump would be important to be > > supported, to be clear, so far supporting kexec/kdump for Xen/Linux is > > not priority for my work. > > > > Also thanks a lot for Ard and Mark's replying. To be honest, I missed > > many prerequisites (e.g. redistributor configurations for GIC in > > hypervisor) and seems Xen uses a different way by emulating GICv3 > > controller for guest OS. So now I am bit puzzle what's for next step > > or just keep as it is? > > > > If i understand Julien's remark correctly, the dom0 GICv3 is emulated, > and so it should not suffer from the issue that we are working around > here. The problem is that there is no way to distinguish a system that suffers from GICR LPI tables being immutable from one that allows the LPI configuration being changed (either because the HW allows it or because the hypervisor plays other games). Once you're in the secondary kernel with the RDs enabled, you have already lost if there is a chance you could have reused this memory, and that's irrespective of being a guest or bare metal (interrupt delivery should still work during kexec). > Given that this workaround is still the sole user of the MEMRESERVE > API, we would like to remain free to rip it out and replace it > completely if we need to, and so implementing it in Xen is a bad idea, > especially if the issue in question does not even exist in your case. > This means that even if you do decide to support kexec, things should > work fine in spite of this warning regarding the failure to perform > the memory reservation, as the GIC can simply be reconfigured. > > So perhaps we should just [conditionally] tone down the warning? What we could do instead is to have some form of kexec hook that tries to go and turn the RDs off before jumping into the secondary kernel. If the hypervisor allows this (and honours the GICR_CTLR.EnableLPI bit), then there won't be any screaming in the secondary kernel. This is probably a new infrastructure though, as I don't think we have anything of the sort at the moment. Thanks, M.
On Fri, Aug 19, 2022 at 01:10:10PM +0100, Marc Zyngier wrote: [...] > > > > In the context of Xen, dom0 doesn't have direct access to the host ITS > > > > because we are emulating it. So I think it doesn't matter for us because we > > > > can fix our implementation if it is affected. > > > > > > > > That said, kexec-ing dom0 (or any other domain) on Xen on Arm would require > > > > some work to be supported. OOI, @leo is it something you are investigating? > > > > > > > > > Now I am working on automative reference platform; the first thing for > > > me is to resolve the kernel oops. > > > > > > For long term, I think the kexec/kdump would be important to be > > > supported, to be clear, so far supporting kexec/kdump for Xen/Linux is > > > not priority for my work. > > > > > > Also thanks a lot for Ard and Mark's replying. To be honest, I missed > > > many prerequisites (e.g. redistributor configurations for GIC in > > > hypervisor) and seems Xen uses a different way by emulating GICv3 > > > controller for guest OS. So now I am bit puzzle what's for next step > > > or just keep as it is? > > > > > > > If i understand Julien's remark correctly, the dom0 GICv3 is emulated, > > and so it should not suffer from the issue that we are working around > > here. Before proceeding discussion, I would like step back to get clear for the GIC implementation in Xen, otherwise, it's really hard for me to catch up the dicussion :) For me it's clear that Xen emulates GICv3 for DomU, but I am still confused how GICv3 works for Dom0. Xen directly passes ACPI MADT table from UEFI to Linux kernel to Dom0, (see functions acpi_create_madt() and gic_make_hwdom_madt()), which means the Linux kernel Dom0 uses the same ACPI table to initialize GICv3 driver, but since Linux kernel Dom0 accesses GIC memory region as IPA, it still trap to Xen in EL2 for stage 2 translation, so finally Xen can emulate the GICv3 device for Dom0. This is quite different from DomU. Xen prepares a DT node for GICv3 rather than directly passing ACPI table, so DomU kernel initializes GICv3 driver based on the DT binding. Simply to say, no matter Dom0 using ACPI table or DomU using DT binding, at the end Xen emulates GICv3 device for all of them. Another thing is not clear for me is that I can see Xen allocates redistributor pending page (see gicv3_lpi_set_pendtable()), after Dom0 or DomU kernel boots up, kernel allocates another RD pending page; so the question is how these two different pending pages co-work together. In other words, let's assume the Dom0 kernel panic and its secondary kernel is launched by kexec, is it necessarily for the secondary kernel to reuse the primary kernel's RD pending page? Or in this case it's no matter for the RD pending page in Dom0 and it's safe for Xen always maintains its own RD pending page in EL2? > The problem is that there is no way to distinguish a system that > suffers from GICR LPI tables being immutable from one that allows the > LPI configuration being changed (either because the HW allows it or > because the hypervisor plays other games). Let me ask a stupid question. Seems to me, GICR LPI tables can be configured as below options - The hypervisor pre-allocates GICR LPI tables and pass the memory region info to Dom0 kernel; - The hypervisor doesn't allocate GICR LPI tables, and then Dom0 kernel allocates GICR LPI tables for the virtual GICv3, and Dom0 directly write data to the GICR LPI tables and the table is used by physical GICv3; - The hypervisor pre-allocates GICR LPI tables for phycial GICv3 and Dom0 kernel allocates another GICR LPI tables for virtual GICv3, and Xen needs to sync between these two tables. To be clear, all of above three options are hypothesis. So please correct me if anything is wrong (or even total are wrong?!). Thanks a lot for suggestions! Leo P.s. sorry for truncating Marc's following comments, just want to focus on above questions.
Hi, On 25/08/2022 08:59, Leo Yan wrote: > On Fri, Aug 19, 2022 at 01:10:10PM +0100, Marc Zyngier wrote: > > [...] > >>>>> In the context of Xen, dom0 doesn't have direct access to the host ITS >>>>> because we are emulating it. So I think it doesn't matter for us because we >>>>> can fix our implementation if it is affected. >>>>> >>>>> That said, kexec-ing dom0 (or any other domain) on Xen on Arm would require >>>>> some work to be supported. OOI, @leo is it something you are investigating? >>>> >>>> >>>> Now I am working on automative reference platform; the first thing for >>>> me is to resolve the kernel oops. >>>> >>>> For long term, I think the kexec/kdump would be important to be >>>> supported, to be clear, so far supporting kexec/kdump for Xen/Linux is >>>> not priority for my work. >>>> >>>> Also thanks a lot for Ard and Mark's replying. To be honest, I missed >>>> many prerequisites (e.g. redistributor configurations for GIC in >>>> hypervisor) and seems Xen uses a different way by emulating GICv3 >>>> controller for guest OS. So now I am bit puzzle what's for next step >>>> or just keep as it is? >>>> >>> >>> If i understand Julien's remark correctly, the dom0 GICv3 is emulated, >>> and so it should not suffer from the issue that we are working around >>> here. > > Before proceeding discussion, I would like step back to get clear for > the GIC implementation in Xen, otherwise, it's really hard for me to > catch up the dicussion :) > > For me it's clear that Xen emulates GICv3 for DomU, but I am still > confused how GICv3 works for Dom0. > > Xen directly passes ACPI MADT table from UEFI to Linux kernel to Dom0, > (see functions acpi_create_madt() and gic_make_hwdom_madt()), which > means the Linux kernel Dom0 uses the same ACPI table to initialize GICv3 > driver, but since Linux kernel Dom0 accesses GIC memory region as IPA, > it still trap to Xen in EL2 for stage 2 translation, so finally Xen > can emulate the GICv3 device for Dom0. In the default setup, dom0 is also the hardware domain. So it owns all of the devices but the ones used by Xen (e.g. interrupt controller, SMMU). Therefore, dom0 will use the same memory layout as the host. At which point, it is a lot more convenient to re-use the host ACPI tables and rewrite only what's necessary. > > This is quite different from DomU. Xen prepares a DT node for GICv3 > rather than directly passing ACPI table, so DomU kernel initializes > GICv3 driver based on the DT binding. DomUs memory layout is defined by Xen. So we need to create the Device-Tree and ACPI tables (both are supported) from scrartch. > > Simply to say, no matter Dom0 using ACPI table or DomU using DT > binding, at the end Xen emulates GICv3 device for all of them. Correct. In both situations the GICv3 will be owned by Xen and we will emulate the bits that are not virtualized by the CPUs (e.g. re-distributors). > > Another thing is not clear for me is that I can see Xen allocates > redistributor pending page (see gicv3_lpi_set_pendtable()), after Dom0 > or DomU kernel boots up, kernel allocates another RD pending page; so > the question is how these two different pending pages co-work > together. Xen allocates the pending pages that will be used by the host. Dom0/DomU will be allocating pending pages that will be used by the virtual GICv3. > > In other words, let's assume the Dom0 kernel panic and its secondary > kernel is launched by kexec, is it necessarily for the secondary > kernel to reuse the primary kernel's RD pending page? No. > Or in this case > it's no matter for the RD pending page in Dom0 and it's safe for Xen > always maintains its own RD pending page in EL2? Dom0 doesn't have direct access to the host GICv3 (this will be controlled by Xen). What we expose to dom0 is a virtual GICv3. So effectively we have two different GICv3 and each of them will require their own set of pending table. > >> The problem is that there is no way to distinguish a system that >> suffers from GICR LPI tables being immutable from one that allows the >> LPI configuration being changed (either because the HW allows it or >> because the hypervisor plays other games). > > Let me ask a stupid question. Seems to me, GICR LPI tables can be > configured as below options > > - The hypervisor pre-allocates GICR LPI tables and pass the memory > region info to Dom0 kernel; > > - The hypervisor doesn't allocate GICR LPI tables, and then Dom0 > kernel allocates GICR LPI tables for the virtual GICv3, and Dom0 > directly write data to the GICR LPI tables and the table is used by > physical GICv3; > > - The hypervisor pre-allocates GICR LPI tables for phycial GICv3 and > Dom0 kernel allocates another GICR LPI tables for virtual GICv3, > and Xen needs to sync between these two tables. > > To be clear, all of above three options are hypothesis. So please > correct me if anything is wrong (or even total are wrong?!). I will defer this question to Marc. Cheers,
Hi Julien, On Thu, Aug 25, 2022 at 10:07:18AM +0100, Julien Grall wrote: [...] > > In other words, let's assume the Dom0 kernel panic and its secondary > > kernel is launched by kexec, is it necessarily for the secondary > > kernel to reuse the primary kernel's RD pending page? > > No. If the answer is no, then I think it's feasible to pass the same ACPI table or DT binding for virtual GICv3 from primary kernel to secondary kernel, then the second kernel can initialize the VGIC and allocate a new RD tables (and trap to Xen in EL2 to handle the new allocated RD tables). How about you think for this? Thanks a lot for quick response. Leo
On Thu, Aug 25, 2022 at 10:07:18AM +0100, Julien Grall wrote: [...] > > Xen directly passes ACPI MADT table from UEFI to Linux kernel to Dom0, > > (see functions acpi_create_madt() and gic_make_hwdom_madt()), which > > means the Linux kernel Dom0 uses the same ACPI table to initialize GICv3 > > driver, but since Linux kernel Dom0 accesses GIC memory region as IPA, > > it still trap to Xen in EL2 for stage 2 translation, so finally Xen > > can emulate the GICv3 device for Dom0. > > In the default setup, dom0 is also the hardware domain. So it owns all of > the devices but the ones used by Xen (e.g. interrupt controller, SMMU). > > Therefore, dom0 will use the same memory layout as the host. At which point, > it is a lot more convenient to re-use the host ACPI tables and rewrite only > what's necessary. We cannot purely talk about interrupt handling without connecting with device driver model. Seems to me, to support para virtualization driver model (like virtio), Dom0 needs to provide the device driver backend, and DomUs enables the forend device drivers. In this case, the most hardware interrupts (SPIs) are routed to Dom0. To support passthrough driver model (VFIO), Xen needs to configure the hardware GICv3 to directly route hardware interrupt to the virtual CPU interface. But here I still cannot create the concept that how GIC RD tables play roles to support the para virtualization or passthrough mode. Thanks, Leo
Hi Leo, On 25/08/2022 12:24, Leo Yan wrote: > On Thu, Aug 25, 2022 at 10:07:18AM +0100, Julien Grall wrote: > > [...] > >>> In other words, let's assume the Dom0 kernel panic and its secondary >>> kernel is launched by kexec, is it necessarily for the secondary >>> kernel to reuse the primary kernel's RD pending page? >> >> No. > > If the answer is no, then I think it's feasible to pass the same ACPI > table or DT binding for virtual GICv3 from primary kernel to secondary > kernel, then the second kernel can initialize the VGIC and allocate a > new RD tables (and trap to Xen in EL2 to handle the new allocated RD > tables). How about you think for this? I think that would work. Cheers,
Hi Leo, On 25/08/2022 12:50, Leo Yan wrote: > On Thu, Aug 25, 2022 at 10:07:18AM +0100, Julien Grall wrote: > > [...] > >>> Xen directly passes ACPI MADT table from UEFI to Linux kernel to Dom0, >>> (see functions acpi_create_madt() and gic_make_hwdom_madt()), which >>> means the Linux kernel Dom0 uses the same ACPI table to initialize GICv3 >>> driver, but since Linux kernel Dom0 accesses GIC memory region as IPA, >>> it still trap to Xen in EL2 for stage 2 translation, so finally Xen >>> can emulate the GICv3 device for Dom0. >> >> In the default setup, dom0 is also the hardware domain. So it owns all of >> the devices but the ones used by Xen (e.g. interrupt controller, SMMU). >> >> Therefore, dom0 will use the same memory layout as the host. At which point, >> it is a lot more convenient to re-use the host ACPI tables and rewrite only >> what's necessary. > > We cannot purely talk about interrupt handling without connecting with > device driver model. > > Seems to me, to support para virtualization driver model (like virtio), > Dom0 needs to provide the device driver backend, and DomUs enables > the forend device drivers. In this case, the most hardware interrupts > (SPIs) are routed to Dom0. That's correct. Most of the shared interrupts will be routed to dom0. > To support passthrough driver model (VFIO), Xen needs to configure the > hardware GICv3 to directly route hardware interrupt to the virtual CPU > interface. Do you mean GICv4 rather than GICv3? In the latter, all the interrupts will be received in Xen and then routed to the domain by updating the LRs. > > But here I still cannot create the concept that how GIC RD tables play > roles to support the para virtualization or passthrough mode. I am not sure what you are actually asking. The pending tables are just memory you give to the GICv3 to record the state of the interrupts. Cheers,
Hi Julien, On Thu, Aug 25, 2022 at 01:59:06PM +0100, Julien Grall wrote: [...] > > Seems to me, to support para virtualization driver model (like virtio), > > Dom0 needs to provide the device driver backend, and DomUs enables > > the forend device drivers. In this case, the most hardware interrupts > > (SPIs) are routed to Dom0. > > That's correct. Most of the shared interrupts will be routed to dom0. Thanks for confirmation. > > To support passthrough driver model (VFIO), Xen needs to configure the > > hardware GICv3 to directly route hardware interrupt to the virtual CPU > > interface. > > Do you mean GICv4 rather than GICv3? In the latter, all the interrupts will > be received in Xen and then routed to the domain by updating the LRs. Thanks for clarification. So GICv3 relies on hypervisor to set LR, and VM can use virtural interface to response (ACK/EOI) the interrupt. GICv4 can directly forward the SPI to the CPU virtual interface (without hypervisor's interfering). > > But here I still cannot create the concept that how GIC RD tables play > > roles to support the para virtualization or passthrough mode. > > I am not sure what you are actually asking. The pending tables are just > memory you give to the GICv3 to record the state of the interrupts. For more specific, Xen has its own RD pending table, and we can use this pending table to set state for SGI/PPI/LPI for a specific CPU interface. Xen works as hypervisor, it saves and restores the pending table according to switched in VM context, right? On the other hand, what's the purpose for Linux kernel's GIC RD pending table? Is it only used for nested virtulisation? I mean if Linux kernel's GIC RD pending table is not used for the drivers in Dom0 or DomU, then it's useless to pass it from the primary kernel to secondary kernel; as result, we don't need to reserve the persistent memory for the pending table in this case. Thanks, Leo
On Thu, Aug 25, 2022 at 10:40:41PM +0800, Leo Yan wrote: [...] > > > But here I still cannot create the concept that how GIC RD tables play > > > roles to support the para virtualization or passthrough mode. > > > > I am not sure what you are actually asking. The pending tables are just > > memory you give to the GICv3 to record the state of the interrupts. > > For more specific, Xen has its own RD pending table, and we can use > this pending table to set state for SGI/PPI/LPI for a specific CPU > interface. Xen works as hypervisor, it saves and restores the pending > table according to switched in VM context, right? > > On the other hand, what's the purpose for Linux kernel's GIC RD > pending table? Is it only used for nested virtulisation? I mean if > Linux kernel's GIC RD pending table is not used for the drivers in > Dom0 or DomU, then it's useless to pass it from the primary kernel to > secondary kernel; as result, we don't need to reserve the persistent > memory for the pending table in this case. I don't receive further confirmation from Marc, anyway, I tried to cook a kernel patch to mute the kernel oops [1]. Hope this is not too arbitrary and we can move forward a bit. Thanks, Leo [1] https://lore.kernel.org/lkml/20220906024040.503764-1-leo.yan@linaro.org/T/#u
On Tue, 06 Sep 2022 03:52:37 +0100, Leo Yan <leo.yan@linaro.org> wrote: > > On Thu, Aug 25, 2022 at 10:40:41PM +0800, Leo Yan wrote: > > [...] > > > > > But here I still cannot create the concept that how GIC RD tables play > > > > roles to support the para virtualization or passthrough mode. > > > > > > I am not sure what you are actually asking. The pending tables are just > > > memory you give to the GICv3 to record the state of the interrupts. > > > > For more specific, Xen has its own RD pending table, and we can use > > this pending table to set state for SGI/PPI/LPI for a specific CPU > > interface. Xen works as hypervisor, it saves and restores the pending > > table according to switched in VM context, right? > > > > On the other hand, what's the purpose for Linux kernel's GIC RD > > pending table? Is it only used for nested virtulisation? I mean if > > Linux kernel's GIC RD pending table is not used for the drivers in > > Dom0 or DomU, then it's useless to pass it from the primary kernel to > > secondary kernel; as result, we don't need to reserve the persistent > > memory for the pending table in this case. > > I don't receive further confirmation from Marc, anyway, I tried to cook > a kernel patch to mute the kernel oops [1]. What sort of confirmation do you expect from me? None of what you write above make much sense in the face of the architecture. > Hope this is not too arbitrary and we can move forward a bit. > > Thanks, > Leo > > [1] https://lore.kernel.org/lkml/20220906024040.503764-1-leo.yan@linaro.org/T/#u I'm totally baffled by the fact you're trying to add some extra hacks to Linux just to paper over some of the Xen's own issues. M.
Hi Marc, On Tue, Sep 06, 2022 at 07:27:17AM +0100, Marc Zyngier wrote: > On Tue, 06 Sep 2022 03:52:37 +0100, > Leo Yan <leo.yan@linaro.org> wrote: > > > > On Thu, Aug 25, 2022 at 10:40:41PM +0800, Leo Yan wrote: > > > > [...] > > > > > > > But here I still cannot create the concept that how GIC RD tables play > > > > > roles to support the para virtualization or passthrough mode. > > > > > > > > I am not sure what you are actually asking. The pending tables are just > > > > memory you give to the GICv3 to record the state of the interrupts. > > > > > > For more specific, Xen has its own RD pending table, and we can use > > > this pending table to set state for SGI/PPI/LPI for a specific CPU > > > interface. Xen works as hypervisor, it saves and restores the pending > > > table according to switched in VM context, right? > > > > > > On the other hand, what's the purpose for Linux kernel's GIC RD > > > pending table? Is it only used for nested virtulisation? I mean if > > > Linux kernel's GIC RD pending table is not used for the drivers in > > > Dom0 or DomU, then it's useless to pass it from the primary kernel to > > > secondary kernel; as result, we don't need to reserve the persistent > > > memory for the pending table in this case. > > > > I don't receive further confirmation from Marc, anyway, I tried to cook > > a kernel patch to mute the kernel oops [1]. > > What sort of confirmation do you expect from me? None of what you > write above make much sense in the face of the architecture. Okay, I think have two questions for you: - The first question is if we really need to reserve persistent memory for RD pending table and configuration table when Linux kernel runs in Xen domain? - If the first question's answer is no, so it's not necessary to reserve RD pending table and configuration table for Xen, then what's the good way to dismiss the kernel oops? IIUC, you consider the general flow from architecture view, so you prefer to ask Xen to implement EFI stub to comply the general flow for EFI booting sequence, right? If the conclusion is to change Xen for support EFI stub, then this would be fine for me and I will hold on and leave Xen developers to work on it. > > [1] https://lore.kernel.org/lkml/20220906024040.503764-1-leo.yan@linaro.org/T/#u > > I'm totally baffled by the fact you're trying to add some extra hacks > to Linux just to paper over some of the Xen's own issues. I have a last question for why kernel reserves RD pending table and configuration table for kexec. As we know, the primary kernel and the secondary kernel use separate memory regions, this means there have no race condition that secondary kernel modifies the tables whilist the GIC accesses the table if the secondary kernel allocates new pages for RD tables. So only one potential issue I can image is the secondary kernel sets new RD pending table and configuration table, which might introduce inconsistent issue with rest RDs in the system. Could you confirm if my understanding is correct or not? Sorry for noise and many questions. I understand this is a complex and difficult topic for me, and it's very likely that I am absent sufficient knowledge for this part, this is just what I want to learn from the discussion and from you :-) Thanks, Leo
On Tue, 6 Sept 2022 at 09:17, Leo Yan <leo.yan@linaro.org> wrote: > > Hi Marc, > > On Tue, Sep 06, 2022 at 07:27:17AM +0100, Marc Zyngier wrote: > > On Tue, 06 Sep 2022 03:52:37 +0100, > > Leo Yan <leo.yan@linaro.org> wrote: > > > > > > On Thu, Aug 25, 2022 at 10:40:41PM +0800, Leo Yan wrote: > > > > > > [...] > > > > > > > > > But here I still cannot create the concept that how GIC RD tables play > > > > > > roles to support the para virtualization or passthrough mode. > > > > > > > > > > I am not sure what you are actually asking. The pending tables are just > > > > > memory you give to the GICv3 to record the state of the interrupts. > > > > > > > > For more specific, Xen has its own RD pending table, and we can use > > > > this pending table to set state for SGI/PPI/LPI for a specific CPU > > > > interface. Xen works as hypervisor, it saves and restores the pending > > > > table according to switched in VM context, right? > > > > > > > > On the other hand, what's the purpose for Linux kernel's GIC RD > > > > pending table? Is it only used for nested virtulisation? I mean if > > > > Linux kernel's GIC RD pending table is not used for the drivers in > > > > Dom0 or DomU, then it's useless to pass it from the primary kernel to > > > > secondary kernel; as result, we don't need to reserve the persistent > > > > memory for the pending table in this case. > > > > > > I don't receive further confirmation from Marc, anyway, I tried to cook > > > a kernel patch to mute the kernel oops [1]. > > > > What sort of confirmation do you expect from me? None of what you > > write above make much sense in the face of the architecture. > > Okay, I think have two questions for you: > > - The first question is if we really need to reserve persistent memory > for RD pending table and configuration table when Linux kernel runs > in Xen domain? > > - If the first question's answer is no, so it's not necessary to reserve > RD pending table and configuration table for Xen, then what's the good > way to dismiss the kernel oops? > > IIUC, you consider the general flow from architecture view, so you prefer > to ask Xen to implement EFI stub to comply the general flow for EFI > booting sequence, right? > > If the conclusion is to change Xen for support EFI stub, then this > would be fine for me and I will hold on and leave Xen developers to work > on it. > As I mentioned before, proper EFI boot support in Xen would be nice. *However*, I don't think it makes sense to go through all the trouble of implementing that just to shut up a warning that doesn't affect Xen to begin with. > > > [1] https://lore.kernel.org/lkml/20220906024040.503764-1-leo.yan@linaro.org/T/#u > > > > I'm totally baffled by the fact you're trying to add some extra hacks > > to Linux just to paper over some of the Xen's own issues. > > I have a last question for why kernel reserves RD pending table and > configuration table for kexec. As we know, the primary kernel and > the secondary kernel use separate memory regions, This is only true for kdump, not for kexec in general. > this means there have > no race condition that secondary kernel modifies the tables whilist the > GIC accesses the table if the secondary kernel allocates new pages for > RD tables. So only one potential issue I can image is the secondary > kernel sets new RD pending table and configuration table, which might > introduce inconsistent issue with rest RDs in the system. > > Could you confirm if my understanding is correct or not? > > Sorry for noise and many questions. I understand this is a complex > and difficult topic for me, and it's very likely that I am absent > sufficient knowledge for this part, this is just what I want to > learn from the discussion and from you :-) > > Thanks, > Leo
On Tue, Sep 06, 2022 at 09:22:00AM +0200, Ard Biesheuvel wrote: [...] > > IIUC, you consider the general flow from architecture view, so you prefer > > to ask Xen to implement EFI stub to comply the general flow for EFI > > booting sequence, right? > > > > If the conclusion is to change Xen for support EFI stub, then this > > would be fine for me and I will hold on and leave Xen developers to work > > on it. > > > > As I mentioned before, proper EFI boot support in Xen would be nice. > *However*, I don't think it makes sense to go through all the trouble > of implementing that just to shut up a warning that doesn't affect Xen > to begin with. Another option is we can set a bit for xen feature, so Linux kernel can read out the xen feature and make decision if need to reserve memory for RD tables based on the new feature bit. This is somehow a solution is to create a general protocol between Xen and Linux kernel. How about you think for this? Thanks, Leo
On Tue, Sep 06, 2022 at 03:27:47PM +0800, Leo Yan wrote: > On Tue, Sep 06, 2022 at 09:22:00AM +0200, Ard Biesheuvel wrote: > > [...] > > > > IIUC, you consider the general flow from architecture view, so you prefer > > > to ask Xen to implement EFI stub to comply the general flow for EFI > > > booting sequence, right? > > > > > > If the conclusion is to change Xen for support EFI stub, then this > > > would be fine for me and I will hold on and leave Xen developers to work > > > on it. > > > > > > > As I mentioned before, proper EFI boot support in Xen would be nice. > > *However*, I don't think it makes sense to go through all the trouble > > of implementing that just to shut up a warning that doesn't affect Xen > > to begin with. > > Another option is we can set a bit for xen feature, so Linux kernel > can read out the xen feature and make decision if need to reserve > memory for RD tables based on the new feature bit. This is somehow > a solution is to create a general protocol between Xen and Linux kernel. > > How about you think for this? Just supplement info. I tried to set flag EFI_PARAVIRT in Linux kernel, but kernel cannot boot up successfully on Arm64. Seems the Linux kernel will not map memory correctly after settting this flag. This is why I didn't move forward with this flag. Thanks, Leo
On Tue, 06 Sep 2022 08:17:14 +0100, Leo Yan <leo.yan@linaro.org> wrote: > > Hi Marc, > > On Tue, Sep 06, 2022 at 07:27:17AM +0100, Marc Zyngier wrote: > > On Tue, 06 Sep 2022 03:52:37 +0100, > > Leo Yan <leo.yan@linaro.org> wrote: > > > > > > On Thu, Aug 25, 2022 at 10:40:41PM +0800, Leo Yan wrote: > > > > > > [...] > > > > > > > > > But here I still cannot create the concept that how GIC RD tables play > > > > > > roles to support the para virtualization or passthrough mode. > > > > > > > > > > I am not sure what you are actually asking. The pending tables are just > > > > > memory you give to the GICv3 to record the state of the interrupts. > > > > > > > > For more specific, Xen has its own RD pending table, and we can use > > > > this pending table to set state for SGI/PPI/LPI for a specific CPU > > > > interface. Xen works as hypervisor, it saves and restores the pending > > > > table according to switched in VM context, right? > > > > > > > > On the other hand, what's the purpose for Linux kernel's GIC RD > > > > pending table? Is it only used for nested virtulisation? I mean if > > > > Linux kernel's GIC RD pending table is not used for the drivers in > > > > Dom0 or DomU, then it's useless to pass it from the primary kernel to > > > > secondary kernel; as result, we don't need to reserve the persistent > > > > memory for the pending table in this case. > > > > > > I don't receive further confirmation from Marc, anyway, I tried to cook > > > a kernel patch to mute the kernel oops [1]. > > > > What sort of confirmation do you expect from me? None of what you > > write above make much sense in the face of the architecture. > > Okay, I think have two questions for you: > > - The first question is if we really need to reserve persistent memory > for RD pending table and configuration table when Linux kernel runs > in Xen domain? I have no idea, and really I don't want to know. The architecture doesn't make it safe to reuse that memory, and the driver does the right thing by always reserving that memory when the FW is supposed to support it. The "oh but it is safe on so and so" approach doesn't scale. If you want to have such a thing, just convince people at ARM that it is possible to implement a GICv3-compliant system without the RD tables, get them to update the architecture to allow this scheme and advertise it in a discoverable register. Xen could then implement it, Linux could check this bit, and we'd all be a happy family. Because that's really what this is: it isn't that you don't care about the RD tables being reserved. It is that you don't care about them at all because they are never used by Xen as the GIC implementation. Your approach of "huh, let's not reserve it" just papers over this. > > - If the first question's answer is no, so it's not necessary to reserve > RD pending table and configuration table for Xen, then what's the good > way to dismiss the kernel oops? A warning, not an oops. > > IIUC, you consider the general flow from architecture view, so you prefer > to ask Xen to implement EFI stub to comply the general flow for EFI > booting sequence, right? If you want to use ACPI, you use EFI, and not a vague emulation of it. If you use DT, you can reserve the memory upfront. The various alternatives are in this thread. > > If the conclusion is to change Xen for support EFI stub, then this > would be fine for me and I will hold on and leave Xen developers to work > on it. > > > > [1] https://lore.kernel.org/lkml/20220906024040.503764-1-leo.yan@linaro.org/T/#u > > > > I'm totally baffled by the fact you're trying to add some extra hacks > > to Linux just to paper over some of the Xen's own issues. > > I have a last question for why kernel reserves RD pending table and > configuration table for kexec. As we know, the primary kernel and > the secondary kernel use separate memory regions, No, you got it wrong. Only with *kdump* do you get separate memory regions. kexec reuses all of the memory visible by the primary kernel. > this means there have > no race condition that secondary kernel modifies the tables whilist the > GIC accesses the table if the secondary kernel allocates new pages for > RD tables. So only one potential issue I can image is the secondary > kernel sets new RD pending table and configuration table, which might > introduce inconsistent issue with rest RDs in the system. > > Could you confirm if my understanding is correct or not? It isn't correct. - There is no race condition. Once the RD tables are configured, they cannot be changed. - When the kdump kernel boots, none of the primary OS memory is reused, so it is safe to continue and use the same tables in place - When the kexec kernel boots, all of the memory except for the reserved memory is reused. If your RD tables are used for anything, you'll see memory corruption as the GIC writes pending bits in the pending table, and you'll be unable to configure interrupts correctly. In conclusion, using kexec with GICv3 is completely unsafe if you don't reserve the memory allocated to the RDs. > Sorry for noise and many questions. I understand this is a complex > and difficult topic for me, and it's very likely that I am absent > sufficient knowledge for this part, this is just what I want to > learn from the discussion and from you :-) I suggest you read the architecture spec, which has all the details. M.
On Tue, 06 Sep 2022 08:27:47 +0100, Leo Yan <leo.yan@linaro.org> wrote: > > On Tue, Sep 06, 2022 at 09:22:00AM +0200, Ard Biesheuvel wrote: > > [...] > > > > IIUC, you consider the general flow from architecture view, so you prefer > > > to ask Xen to implement EFI stub to comply the general flow for EFI > > > booting sequence, right? > > > > > > If the conclusion is to change Xen for support EFI stub, then this > > > would be fine for me and I will hold on and leave Xen developers to work > > > on it. > > > > > > > As I mentioned before, proper EFI boot support in Xen would be nice. > > *However*, I don't think it makes sense to go through all the trouble > > of implementing that just to shut up a warning that doesn't affect Xen > > to begin with. > > Another option is we can set a bit for xen feature, so Linux kernel > can read out the xen feature and make decision if need to reserve > memory for RD tables based on the new feature bit. This is somehow > a solution is to create a general protocol between Xen and Linux kernel. > > How about you think for this? No. If there is such a bit, it has to be in the GIC architecture. I'm not putting anything hypervisor-specific into the GIC driver. Others have tried before you, and they ended up fixing their hypervisor instead. Feel free to talk to ARM to get the architecture updated instead. M.
On Tue, Sep 06, 2022 at 08:53:02AM +0100, Marc Zyngier wrote: [...] > > Okay, I think have two questions for you: > > > > - The first question is if we really need to reserve persistent memory > > for RD pending table and configuration table when Linux kernel runs > > in Xen domain? > > I have no idea, and really I don't want to know. The architecture > doesn't make it safe to reuse that memory, and the driver does the > right thing by always reserving that memory when the FW is supposed to > support it. > > The "oh but it is safe on so and so" approach doesn't scale. If you > want to have such a thing, just convince people at ARM that it is > possible to implement a GICv3-compliant system without the RD tables, > get them to update the architecture to allow this scheme and advertise > it in a discoverable register. Xen could then implement it, Linux > could check this bit, and we'd all be a happy family. I agree that my patch is not based on a scale approach, this is also my concern. To be honest, convincing Arm GIC team is a bit out of my working scope. I am working on automative project, when I saw verbose log with bunch of kernel warnings with Xen, it motivated me to chase down, this is the main reason I tried to explore some solutions at here. > Because that's really what this is: it isn't that you don't care about > the RD tables being reserved. It is that you don't care about them at > all because they are never used by Xen as the GIC implementation. Your > approach of "huh, let's not reserve it" just papers over this. > > - If the first question's answer is no, so it's not necessary to reserve > > RD pending table and configuration table for Xen, then what's the good > > way to dismiss the kernel oops? > > A warning, not an oops. > > > > > IIUC, you consider the general flow from architecture view, so you prefer > > to ask Xen to implement EFI stub to comply the general flow for EFI > > booting sequence, right? > > If you want to use ACPI, you use EFI, and not a vague emulation of > it. If you use DT, you can reserve the memory upfront. The various > alternatives are in this thread. Given I proposed two fixes, one is for Xen and another is for kernel, but both are rejected, so I would like to hold on a bit for this. Maybe it's a good point for Xen maintainers to review if it needs to support EFI stub, or welcome any other suggestions. > > If the conclusion is to change Xen for support EFI stub, then this > > would be fine for me and I will hold on and leave Xen developers to work > > on it. > > > > > > [1] https://lore.kernel.org/lkml/20220906024040.503764-1-leo.yan@linaro.org/T/#u > > > > > > I'm totally baffled by the fact you're trying to add some extra hacks > > > to Linux just to paper over some of the Xen's own issues. > > > > I have a last question for why kernel reserves RD pending table and > > configuration table for kexec. As we know, the primary kernel and > > the secondary kernel use separate memory regions, > > No, you got it wrong. Only with *kdump* do you get separate memory > regions. kexec reuses all of the memory visible by the primary kernel. Thanks for correction. > > this means there have > > no race condition that secondary kernel modifies the tables whilist the > > GIC accesses the table if the secondary kernel allocates new pages for > > RD tables. So only one potential issue I can image is the secondary > > kernel sets new RD pending table and configuration table, which might > > introduce inconsistent issue with rest RDs in the system. > > > > Could you confirm if my understanding is correct or not? > > It isn't correct. > > - There is no race condition. Once the RD tables are configured, they > cannot be changed. > > - When the kdump kernel boots, none of the primary OS memory is > reused, so it is safe to continue and use the same tables in place > > - When the kexec kernel boots, all of the memory except for the > reserved memory is reused. If your RD tables are used for anything, > you'll see memory corruption as the GIC writes pending bits in the > pending table, and you'll be unable to configure interrupts > correctly. This gives me impression that when do a power cycle, CPUs are reset but GIC is still alive, so for every booting time the same memory region should be reserved for RD tables. To be honest, it's a bit weird for me that if a system cannot reset CPUs and GIC together. > In conclusion, using kexec with GICv3 is completely unsafe if you > don't reserve the memory allocated to the RDs. > > > Sorry for noise and many questions. I understand this is a complex > > and difficult topic for me, and it's very likely that I am absent > > sufficient knowledge for this part, this is just what I want to > > learn from the discussion and from you :-) > > I suggest you read the architecture spec, which has all the details. Thank you for many suggestions! I read a bit for GICv3/v4 arch spec at last week, but I need to do more homework. Thanks, Leo
On Tue, 06 Sep 2022 16:13:25 +0100, Leo Yan <leo.yan@linaro.org> wrote: > > On Tue, Sep 06, 2022 at 08:53:02AM +0100, Marc Zyngier wrote: > > [...] > > > > Okay, I think have two questions for you: > > > > > > - The first question is if we really need to reserve persistent memory > > > for RD pending table and configuration table when Linux kernel runs > > > in Xen domain? > > > > I have no idea, and really I don't want to know. The architecture > > doesn't make it safe to reuse that memory, and the driver does the > > right thing by always reserving that memory when the FW is supposed to > > support it. > > > > The "oh but it is safe on so and so" approach doesn't scale. If you > > want to have such a thing, just convince people at ARM that it is > > possible to implement a GICv3-compliant system without the RD tables, > > get them to update the architecture to allow this scheme and advertise > > it in a discoverable register. Xen could then implement it, Linux > > could check this bit, and we'd all be a happy family. > > I agree that my patch is not based on a scale approach, this is also > my concern. > > To be honest, convincing Arm GIC team is a bit out of my working scope. > I am working on automative project, when I saw verbose log with bunch of > kernel warnings with Xen, it motivated me to chase down, this is the > main reason I tried to explore some solutions at here. And it is fine to explore things. But I don't think there is a cheap option here. [...] > > - When the kexec kernel boots, all of the memory except for the > > reserved memory is reused. If your RD tables are used for anything, > > you'll see memory corruption as the GIC writes pending bits in the > > pending table, and you'll be unable to configure interrupts > > correctly. > > This gives me impression that when do a power cycle, CPUs are reset > but GIC is still alive, so for every booting time the same memory > region should be reserved for RD tables. > > To be honest, it's a bit weird for me that if a system cannot reset > CPUs and GIC together. There is no reset involved across kexec. That's the whole point. Nothing is reset, nothing is powered off. This is why kexec is so fast, and this is why it is so fragile. Thanks, M.
diff --git a/xen/arch/arm/acpi/domain_build.c b/xen/arch/arm/acpi/domain_build.c index bbdc90f92c..f7d1935f60 100644 --- a/xen/arch/arm/acpi/domain_build.c +++ b/xen/arch/arm/acpi/domain_build.c @@ -242,6 +242,26 @@ static int __init acpi_create_rsdp(struct domain *d, struct membank tbl_add[]) return 0; } +static int __init acpi_create_memreserve(struct domain *d, struct membank tbl_add[]) +{ + struct acpi_table_memreserve *memreserve = NULL; + u64 table_size = sizeof(struct acpi_table_memreserve); + u8 *base_ptr; + + base_ptr = d->arch.efi_acpi_table + + acpi_get_table_offset(tbl_add, TBL_MRSV); + memreserve = (struct acpi_table_memreserve *)base_ptr; + + memreserve->next = 0; + memreserve->size = 0; + memreserve->count = 0; + + tbl_add[TBL_MRSV].start = d->arch.efi_acpi_gpa + + acpi_get_table_offset(tbl_add, TBL_MRSV); + tbl_add[TBL_MRSV].size = table_size; + return 0; +} + static void __init acpi_xsdt_modify_entry(u64 entry[], unsigned long entry_count, char *signature, u64 addr) @@ -543,6 +563,10 @@ int __init prepare_acpi(struct domain *d, struct kernel_info *kinfo) if ( rc != 0 ) return rc; + rc = acpi_create_memreserve(d, tbl_add); + if ( rc != 0 ) + return rc; + acpi_map_other_tables(d); acpi_create_efi_system_table(d, tbl_add); acpi_create_efi_mmap_table(d, &kinfo->mem, tbl_add); diff --git a/xen/arch/arm/efi/efi-dom0.c b/xen/arch/arm/efi/efi-dom0.c index aae0f97911..4950f9ac99 100644 --- a/xen/arch/arm/efi/efi-dom0.c +++ b/xen/arch/arm/efi/efi-dom0.c @@ -38,7 +38,7 @@ size_t __init estimate_efi_size(unsigned int mem_nr_banks) { size_t size; size_t est_size = sizeof(EFI_SYSTEM_TABLE); - size_t ect_size = sizeof(EFI_CONFIGURATION_TABLE); + size_t ect_size = sizeof(EFI_CONFIGURATION_TABLE) * 2; size_t emd_size = sizeof(EFI_MEMORY_DESCRIPTOR); size_t fw_vendor_size = sizeof(xen_efi_fw_vendor); unsigned int acpi_mem_nr_banks = 0; @@ -63,7 +63,8 @@ void __init acpi_create_efi_system_table(struct domain *d, table_addr = d->arch.efi_acpi_gpa + acpi_get_table_offset(tbl_add, TBL_EFIT); - table_size = sizeof(EFI_SYSTEM_TABLE) + sizeof(EFI_CONFIGURATION_TABLE) + table_size = sizeof(EFI_SYSTEM_TABLE) + + sizeof(EFI_CONFIGURATION_TABLE) * 2 + sizeof(xen_efi_fw_vendor); base_ptr = d->arch.efi_acpi_table + acpi_get_table_offset(tbl_add, TBL_EFIT); @@ -75,7 +76,7 @@ void __init acpi_create_efi_system_table(struct domain *d, efi_sys_tbl->Hdr.HeaderSize = table_size; efi_sys_tbl->FirmwareRevision = 1; - efi_sys_tbl->NumberOfTableEntries = 1; + efi_sys_tbl->NumberOfTableEntries = 2; offset += sizeof(EFI_SYSTEM_TABLE); memcpy(base_ptr + offset, xen_efi_fw_vendor, sizeof(xen_efi_fw_vendor)); efi_sys_tbl->FirmwareVendor = (CHAR16 *)(table_addr + offset); @@ -86,6 +87,18 @@ void __init acpi_create_efi_system_table(struct domain *d, efi_conf_tbl->VendorTable = (VOID *)tbl_add[TBL_RSDP].start; efi_sys_tbl->ConfigurationTable = (EFI_CONFIGURATION_TABLE *)(table_addr + offset); + + /* + * Configuration table for MEMRESERVE is used in Linux kernel for + * reserving pages, its main purpose is used for kexec/kdump to + * reserve persistent pages (e.g. GIC pending pages) for the secondary + * kernel. + */ + offset += sizeof(EFI_CONFIGURATION_TABLE); + efi_conf_tbl = (EFI_CONFIGURATION_TABLE *)(base_ptr + offset); + efi_conf_tbl->VendorGuid = (EFI_GUID)LINUX_EFI_MEMRESERVE_TABLE_GUID; + efi_conf_tbl->VendorTable = (VOID *)tbl_add[TBL_MRSV].start; + xz_crc32_init(); efi_sys_tbl->Hdr.CRC32 = xz_crc32((uint8_t *)efi_sys_tbl, efi_sys_tbl->Hdr.HeaderSize, 0); diff --git a/xen/arch/arm/include/asm/acpi.h b/xen/arch/arm/include/asm/acpi.h index e53973e054..a8c523e387 100644 --- a/xen/arch/arm/include/asm/acpi.h +++ b/xen/arch/arm/include/asm/acpi.h @@ -35,6 +35,7 @@ typedef enum { TBL_STAO, TBL_XSDT, TBL_RSDP, + TBL_MRSV, TBL_EFIT, TBL_MMAP, TBL_MMAX, diff --git a/xen/include/acpi/actbl.h b/xen/include/acpi/actbl.h index 3079176992..682b27f353 100644 --- a/xen/include/acpi/actbl.h +++ b/xen/include/acpi/actbl.h @@ -302,6 +302,23 @@ struct acpi_table_fadt { #define ACPI_FADT_HW_REDUCED (1<<20) /* 20: [V5] ACPI hardware is not implemented (ACPI 5.0) */ #define ACPI_FADT_LOW_POWER_S0 (1<<21) /* 21: [V5] S0 power savings are equal or better than S3 (ACPI 5.0) */ +/******************************************************************************* + * + * MEMRESERVE - Dummy entry for memory reserve configuration table + * + ******************************************************************************/ + +struct acpi_table_memreserve { + int size; /* allocated size of the array */ + int count; /* number of entries used */ + u64 next; /* pa of next struct instance */ + struct { + u64 base; + u64 size; + } entry[]; +}; + + /* Values for preferred_profile (Preferred Power Management Profiles) */ enum acpi_prefered_pm_profiles { diff --git a/xen/include/efi/efiapi.h b/xen/include/efi/efiapi.h index a616d1238a..a70b3d8a12 100644 --- a/xen/include/efi/efiapi.h +++ b/xen/include/efi/efiapi.h @@ -882,6 +882,8 @@ typedef struct _EFI_BOOT_SERVICES { #define SAL_SYSTEM_TABLE_GUID \ { 0xeb9d2d32, 0x2d88, 0x11d3, {0x9a, 0x16, 0x0, 0x90, 0x27, 0x3f, 0xc1, 0x4d} } +#define LINUX_EFI_MEMRESERVE_TABLE_GUID \ + { 0x888eb0c6, 0x8ede, 0x4ff5, {0xa8, 0xf0, 0x9a, 0xee, 0x5c, 0xb9, 0x77, 0xc2} } typedef struct _EFI_CONFIGURATION_TABLE { EFI_GUID VendorGuid;
On Arm64, when boot Dom0 with the Linux kernel, it reports the warning: [ 0.403737] ------------[ cut here ]------------ [ 0.403738] WARNING: CPU: 30 PID: 0 at drivers/irqchip/irq-gic-v3-its.c:3074 its_cpu_init+0x814/0xae0 [ 0.403745] Modules linked in: [ 0.403748] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G W 5.15.23-ampere-lts-standard #1 [ 0.403752] pstate: 600001c5 (nZCv dAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 0.403755] pc : its_cpu_init+0x814/0xae0 [ 0.403758] lr : its_cpu_init+0x810/0xae0 [ 0.403761] sp : ffff800009c03ce0 [ 0.403762] x29: ffff800009c03ce0 x28: 000000000000001e x27: ffff880711f43000 [ 0.403767] x26: ffff80000a3c0070 x25: fffffc1ffe0a4400 x24: ffff80000a3c0000 [ 0.403770] x23: ffff8000095bc998 x22: ffff8000090a6000 x21: ffff800009850cb0 [ 0.403774] x20: ffff800009701a10 x19: ffff800009701000 x18: ffffffffffffffff [ 0.403777] x17: 3030303035303031 x16: 3030313030303078 x15: 303a30206e6f6967 [ 0.403780] x14: 6572206530312072 x13: 3030303030353030 x12: 3130303130303030 [ 0.403784] x11: 78303a30206e6f69 x10: 6765722065303120 x9 : ffff80000870e710 [ 0.403788] x8 : 6964657220646e75 x7 : 0000000000000003 x6 : 0000000000000000 [ 0.403791] x5 : 0000000000000000 x4 : fffffc0000000000 x3 : 0000000000000010 [ 0.403794] x2 : 000000000000ffff x1 : 0000000000010000 x0 : 00000000ffffffed [ 0.403798] Call trace: [ 0.403799] its_cpu_init+0x814/0xae0 [ 0.403802] gic_starting_cpu+0x48/0x90 [ 0.403805] cpuhp_invoke_callback+0x16c/0x5b0 [ 0.403808] cpuhp_invoke_callback_range+0x78/0xf0 [ 0.403811] notify_cpu_starting+0xbc/0xdc [ 0.403814] secondary_start_kernel+0xe0/0x170 [ 0.403817] __secondary_switched+0x94/0x98 [ 0.403821] ---[ end trace f68728a0d3053b70 ]--- In the Linux kernel, the GIC driver tries to reserve ITS interrupt table, and the reserved pages can survive for kexec/kdump and will be used for secondary kernel. Linux kernel relies on MEMRESERVE EFI configuration table for memory page , but this table is not supported by Xen. This patch adds a MEMRESERVE configuration table into the system table, it points to a dummy data structure acpi_table_memreserve, this structure has the consistent definition with the Linux kernel. Following the method in Xen, it creates a table entry for the structure acpi_table_memreserve. Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com> Cc: Rahul Singh <Rahul.Singh@arm.com> Cc: Peter Griffin <peter.griffin@linaro.org> Signed-off-by: Leo Yan <leo.yan@linaro.org> --- xen/arch/arm/acpi/domain_build.c | 24 ++++++++++++++++++++++++ xen/arch/arm/efi/efi-dom0.c | 19 ++++++++++++++++--- xen/arch/arm/include/asm/acpi.h | 1 + xen/include/acpi/actbl.h | 17 +++++++++++++++++ xen/include/efi/efiapi.h | 2 ++ 5 files changed, 60 insertions(+), 3 deletions(-)