diff mbox series

xen/efi: Fix Grub2 boot on arm64

Message ID 20211102140511.5542-1-luca.fancellu@arm.com (mailing list archive)
State Superseded
Headers show
Series xen/efi: Fix Grub2 boot on arm64 | expand

Commit Message

Luca Fancellu Nov. 2, 2021, 2:05 p.m. UTC
The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
("arm/efi: Use dom0less configuration when using EFI boot") is
introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.

The problem comes from the function get_parent_handle(...) that inside
uses the HandleProtocol on loaded_image->DeviceHandle, but the last
is NULL, making Xen stop the UEFI boot.

Before the commit above, the function was never called because the
logic was skipping the call when there were multiboot modules in the
DT because the filesystem was never used and the bootloader had
put in place all the right modules in memory and the addresses
in the DT.

To fix the problem we allow the get_parent_handle(...) function to
return a NULL handle on error and we check the usage of the function
to handle the new use case. The function in fact should not prevent
the boot even if the filesystem can't be used, because the DT and
the modules could be put in place by the bootloader before running
Xen and if xen,uefi-binary property is not used, there is no need
for the filesystem.

Another problem is found when the UEFI stub tries to check if Dom0
image or DomUs are present.
The logic doesn't work when the UEFI stub is not responsible to load
any modules, so the efi_check_dt_boot(...) return value is modified
to return the number of multiboot module found and not only the number
of module loaded by the stub.

Fixes: a1743fc3a9 ("arm/efi: Use dom0less configuration when using EFI boot")
Signed-off-by: Luca Fancellu <luca.fancellu@arm.com>
---
Justification for integration in 4.16:
Upside: allow booting xen from grub on arm64 when the stub doesn't load
        any module.
Downside: It's affecting the EFI boot path.
Risk: It's not affecting x86 arch that works the same way as before.
      If something is wrong it creates a problem on early boot and not at
      runtime, so risk is low.

Tested in this configurations:
 - Bootloader loads modules and specify them as multiboot modules in DT:
   * combination of Dom0, DomUs, Dom0 and DomUs
 - DT specifies multiboot modules in DT using xen,uefi-binary property:
   * combination of Dom0, DomUs, Dom0 and DomUs
 - Bootloader loads a Dom0 module and appends it as multiboot module in DT,
   other multiboot modules are listed for DomUs using xen,uefi-binary
 - No multiboot modules in DT and no kernel entry in cfg file:
   * proper error thrown

---
 xen/arch/arm/efi/efi-boot.h | 28 ++++++++++++++++++----------
 xen/common/efi/boot.c       | 15 ++++++++++++++-
 2 files changed, 32 insertions(+), 11 deletions(-)

Comments

Jan Beulich Nov. 2, 2021, 2:45 p.m. UTC | #1
On 02.11.2021 15:05, Luca Fancellu wrote:
> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
> ("arm/efi: Use dom0less configuration when using EFI boot") is
> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
> 
> The problem comes from the function get_parent_handle(...) that inside
> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
> is NULL, making Xen stop the UEFI boot.

According to my reading the UEFI spec doesn't (explicitly) allow for
this to be NULL. Could you clarify why this is the case? What other
information may end up being invalid / absent? Is e.g. read_section()
safe to use?

> --- a/xen/common/efi/boot.c
> +++ b/xen/common/efi/boot.c
> @@ -449,6 +449,13 @@ static EFI_FILE_HANDLE __init get_parent_handle(EFI_LOADED_IMAGE *loaded_image,
>      CHAR16 *pathend, *ptr;
>      EFI_STATUS ret;
>  
> +    /*
> +     * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
> +     * to have access to the filesystem.
> +     */
> +    if ( !loaded_image->DeviceHandle )
> +        return NULL;

I couldn't find anything in the spec saying that NULL (a pointer with
the numeric value zero) could actually not be a valid handle. Could
you point me to text saying so?

> @@ -581,6 +588,8 @@ static bool __init read_file(EFI_FILE_HANDLE dir_handle, CHAR16 *name,
>      EFI_STATUS ret;
>      const CHAR16 *what = NULL;
>  
> +    if ( !dir_handle )
> +        blexit(L"Error: No access to the filesystem");

dir_handle also gets passed to efi_arch_cfg_file_{early,late}() -
those don't need any adjustment only because they merely pass the
parameter on to read_file()?

> @@ -1333,6 +1342,9 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
>              EFI_FILE_HANDLE handle = get_parent_handle(loaded_image,
>                                                         &file_name);
>  
> +            if ( !handle )
> +                blexit(L"Error retrieving image name: no filesystem access");

I don't think this should be fatal - see the comment ahead of the
enclosing if().

Jan
Luca Fancellu Nov. 2, 2021, 5:12 p.m. UTC | #2
+ Ian Jackson for 4.16 release

> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
> 
> On 02.11.2021 15:05, Luca Fancellu wrote:
>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>> 
>> The problem comes from the function get_parent_handle(...) that inside
>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>> is NULL, making Xen stop the UEFI boot.
> 
> According to my reading the UEFI spec doesn't (explicitly) allow for
> this to be NULL. Could you clarify why this is the case? What other
> information may end up being invalid / absent? Is e.g. read_section()
> safe to use?

My test on an arm machine running Grub2 on top of EDK2 showed that
when Xen is started, the get_parent_handle(…) call was failing and stopping
the boot because the efi_bs->HandleProtocol(…) was called with the
loaded_image->DeviceHandle argument NULL and the call was returning
a EFI_INVALID_PARAMETER.
So the parent handle can’t be requested and the filesystem can’t be used,
but any other code that doesn’t use the handle provided by get_parent_handle(…)
can be used without problem like read_section(...).

> 
>> --- a/xen/common/efi/boot.c
>> +++ b/xen/common/efi/boot.c
>> @@ -449,6 +449,13 @@ static EFI_FILE_HANDLE __init get_parent_handle(EFI_LOADED_IMAGE *loaded_image,
>>     CHAR16 *pathend, *ptr;
>>     EFI_STATUS ret;
>> 
>> +    /*
>> +     * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>> +     * to have access to the filesystem.
>> +     */
>> +    if ( !loaded_image->DeviceHandle )
>> +        return NULL;
> 
> I couldn't find anything in the spec saying that NULL (a pointer with
> the numeric value zero) could actually not be a valid handle. Could
> you point me to text saying so?

I am reading UEFI spec 2.8 A, section 7.3 Protocol Handler Services, when it talks about
EFI_BOOT_SERVICES.HandleProtocol() there is a table of “Status Code Returned” listing
the EFI_INVALID_PARAMETER when the Handle is NULL.

> 
>> @@ -581,6 +588,8 @@ static bool __init read_file(EFI_FILE_HANDLE dir_handle, CHAR16 *name,
>>     EFI_STATUS ret;
>>     const CHAR16 *what = NULL;
>> 
>> +    if ( !dir_handle )
>> +        blexit(L"Error: No access to the filesystem");
> 
> dir_handle also gets passed to efi_arch_cfg_file_{early,late}() -
> those don't need any adjustment only because they merely pass the
> parameter on to read_file()?

Yes, the handling is done in read_file(...)

> 
>> @@ -1333,6 +1342,9 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
>>             EFI_FILE_HANDLE handle = get_parent_handle(loaded_image,
>>                                                        &file_name);
>> 
>> +            if ( !handle )
>> +                blexit(L"Error retrieving image name: no filesystem access");
> 
> I don't think this should be fatal - see the comment ahead of the
> enclosing if().

I’m not sure I get it, I put the fatal condition in part because the handle was dereferenced by
handle->Close(handle), but also because file_name would have not being modified by the call
and we have then *argv = file_name.

Cheers,
Luca

> 
> Jan
>
Stefano Stabellini Nov. 2, 2021, 11:17 p.m. UTC | #3
On Tue, 2 Nov 2021, Luca Fancellu wrote:
> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
> ("arm/efi: Use dom0less configuration when using EFI boot") is
> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
> 
> The problem comes from the function get_parent_handle(...) that inside
> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
> is NULL, making Xen stop the UEFI boot.
> 
> Before the commit above, the function was never called because the
> logic was skipping the call when there were multiboot modules in the
> DT because the filesystem was never used and the bootloader had
> put in place all the right modules in memory and the addresses
> in the DT.
> 
> To fix the problem we allow the get_parent_handle(...) function to
> return a NULL handle on error and we check the usage of the function
> to handle the new use case. The function in fact should not prevent
> the boot even if the filesystem can't be used, because the DT and
> the modules could be put in place by the bootloader before running
> Xen and if xen,uefi-binary property is not used, there is no need
> for the filesystem.
> 
> Another problem is found when the UEFI stub tries to check if Dom0
> image or DomUs are present.
> The logic doesn't work when the UEFI stub is not responsible to load
> any modules, so the efi_check_dt_boot(...) return value is modified
> to return the number of multiboot module found and not only the number
> of module loaded by the stub.
> 
> Fixes: a1743fc3a9 ("arm/efi: Use dom0less configuration when using EFI boot")
> Signed-off-by: Luca Fancellu <luca.fancellu@arm.com>
> ---
> Justification for integration in 4.16:
> Upside: allow booting xen from grub on arm64 when the stub doesn't load
>         any module.
> Downside: It's affecting the EFI boot path.
> Risk: It's not affecting x86 arch that works the same way as before.
>       If something is wrong it creates a problem on early boot and not at
>       runtime, so risk is low.
> 
> Tested in this configurations:
>  - Bootloader loads modules and specify them as multiboot modules in DT:
>    * combination of Dom0, DomUs, Dom0 and DomUs
>  - DT specifies multiboot modules in DT using xen,uefi-binary property:
>    * combination of Dom0, DomUs, Dom0 and DomUs
>  - Bootloader loads a Dom0 module and appends it as multiboot module in DT,
>    other multiboot modules are listed for DomUs using xen,uefi-binary
>  - No multiboot modules in DT and no kernel entry in cfg file:
>    * proper error thrown
> 
> ---
>  xen/arch/arm/efi/efi-boot.h | 28 ++++++++++++++++++----------
>  xen/common/efi/boot.c       | 15 ++++++++++++++-
>  2 files changed, 32 insertions(+), 11 deletions(-)
> 
> diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
> index 8b88dd26a5..e714b2b44c 100644
> --- a/xen/arch/arm/efi/efi-boot.h
> +++ b/xen/arch/arm/efi/efi-boot.h
> @@ -51,9 +51,11 @@ static int handle_module_node(EFI_FILE_HANDLE dir_handle,
>                                int module_node_offset,
>                                int reg_addr_cells,
>                                int reg_size_cells,
> -                              bool is_domu_module);
> +                              bool is_domu_module,
> +                              unsigned int *modules_found);
>  static int handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
> -                                       int domain_node);
> +                                       int domain_node,
> +                                       unsigned int *modules_found);
>  static int efi_check_dt_boot(EFI_FILE_HANDLE dir_handle);
>  
>  #define DEVICE_TREE_GUID \
> @@ -707,7 +709,8 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
>                                       int module_node_offset,
>                                       int reg_addr_cells,
>                                       int reg_size_cells,
> -                                     bool is_domu_module)
> +                                     bool is_domu_module,
> +                                     unsigned int *modules_found)
>  {
>      const void *uefi_name_prop;
>      char mod_string[24]; /* Placeholder for module@ + a 64-bit number + \0 */
> @@ -725,6 +728,9 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
>          /* Module is not a multiboot,module */
>          return 0;
>  
> +    /* Count the multiboot module as found */
> +    (*modules_found)++;
> +
>      /* Read xen,uefi-binary property to get the file name. */
>      uefi_name_prop = fdt_getprop(fdt, module_node_offset, "xen,uefi-binary",
>                                   &uefi_name_len);
> @@ -804,7 +810,8 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
>   * Returns 0 on success, negative number on error.
>   */
>  static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
> -                                              int domain_node)
> +                                              int domain_node,
> +                                              unsigned int *modules_found)
>  {
>      int module_node, addr_cells, size_cells, len;
>      const struct fdt_property *prop;
> @@ -834,7 +841,7 @@ static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
>            module_node = fdt_next_subnode(fdt, module_node) )
>      {
>          int ret = handle_module_node(dir_handle, module_node, addr_cells,
> -                                     size_cells, true);
> +                                     size_cells, true, modules_found);
>          if ( ret < 0 )
>              return ret;
>      }
> @@ -845,12 +852,12 @@ static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
>  /*
>   * This function checks for xen domain nodes under the /chosen node for possible
>   * dom0 and domU guests to be loaded.
> - * Returns the number of modules loaded or a negative number for error.
> + * Returns the number of multiboot modules found or a negative number for error.
>   */
>  static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
>  {
>      int chosen, node, addr_len, size_len;
> -    unsigned int i = 0;
> +    unsigned int i = 0, modules_found = 0;
>  
>      /* Check for the chosen node in the current DTB */
>      chosen = setup_chosen_node(fdt, &addr_len, &size_len);
> @@ -868,11 +875,12 @@ static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
>          if ( !fdt_node_check_compatible(fdt, node, "xen,domain") )
>          {
>              /* Found a node with compatible xen,domain; handle this node. */
> -            if ( handle_dom0less_domain_node(dir_handle, node) < 0 )
> +            if ( handle_dom0less_domain_node(dir_handle, node,
> +                                             &modules_found) < 0 )
>                  return ERROR_DT_MODULE_DOMU;
>          }
>          else if ( handle_module_node(dir_handle, node, addr_len, size_len,
> -                                     false) < 0 )
> +                                     false, &modules_found) < 0 )
>                   return ERROR_DT_MODULE_DOM0;

I think there is no need to add modules_found to the parameters of
handle_dom0less_domain_node and handle_module_node. You could just
increment modules_found here for every iteration of the loop where
there is no error.  You would have to change a couple of returns in
handle_module_node, just to give you the idea:


diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
index e714b2b44c..7739789c41 100644
--- a/xen/arch/arm/efi/efi-boot.h
+++ b/xen/arch/arm/efi/efi-boot.h
@@ -726,7 +726,7 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
 
     if ( module_compat != 0 )
         /* Module is not a multiboot,module */
-        return 0;
+        return 1;
 
     /* Count the multiboot module as found */
     (*modules_found)++;
@@ -737,7 +737,7 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
 
     if ( !uefi_name_prop )
         /* Property not found */
-        return 0;
+        return 1;
 
     file_idx = get_module_file_index(uefi_name_prop, uefi_name_len);
     if ( file_idx < 0 )


>      }
>  
> @@ -883,7 +891,7 @@ static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
>          efi_bs->FreePool(modules[i].name);
>      }
>  
> -    return modules_idx;
> +    return modules_found;
>  }
>  
>  static void __init efi_arch_cpu(void)
> diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c
> index 392ff3ac9b..495e7a4096 100644
> --- a/xen/common/efi/boot.c
> +++ b/xen/common/efi/boot.c
> @@ -449,6 +449,13 @@ static EFI_FILE_HANDLE __init get_parent_handle(EFI_LOADED_IMAGE *loaded_image,
>      CHAR16 *pathend, *ptr;
>      EFI_STATUS ret;
>  
> +    /*
> +     * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
> +     * to have access to the filesystem.
> +     */
> +    if ( !loaded_image->DeviceHandle )
> +        return NULL;
> +
>      do {
>          EFI_FILE_IO_INTERFACE *fio;
>  
> @@ -581,6 +588,8 @@ static bool __init read_file(EFI_FILE_HANDLE dir_handle, CHAR16 *name,
>      EFI_STATUS ret;
>      const CHAR16 *what = NULL;
>  
> +    if ( !dir_handle )
> +        blexit(L"Error: No access to the filesystem");
>      if ( !name )
>          PrintErrMesg(L"No filename", EFI_OUT_OF_RESOURCES);
>      ret = dir_handle->Open(dir_handle, &FileHandle, name,
> @@ -1333,6 +1342,9 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
>              EFI_FILE_HANDLE handle = get_parent_handle(loaded_image,
>                                                         &file_name);
>  
> +            if ( !handle )
> +                blexit(L"Error retrieving image name: no filesystem access");

I think it would be nice to have an other explicit check like this one
at the beginning of if ( use_cfg_file ) to make sure dir_handle is not
null in that case. If I am not mistaken, if we take the use_cfg_file
path, dir_handle has to be valid, right?


>              handle->Close(handle);
>              *argv = file_name;
>          }
> @@ -1369,7 +1381,8 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
>      /* Get the number of boot modules specified on the DT or an error (<0) */
>      dt_modules_found = efi_check_dt_boot(dir_handle);
>  
> -    dir_handle->Close(dir_handle);
> +    if ( dir_handle )
> +        dir_handle->Close(dir_handle);
>  
>      if ( dt_modules_found < 0 )
>          /* efi_check_dt_boot throws some error */
> -- 
> 2.17.1
>
Stefano Stabellini Nov. 2, 2021, 11:17 p.m. UTC | #4
On Tue, 2 Nov 2021, Luca Fancellu wrote:
> + Ian Jackson for 4.16 release
> 
> > On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
> > 
> > On 02.11.2021 15:05, Luca Fancellu wrote:
> >> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
> >> ("arm/efi: Use dom0less configuration when using EFI boot") is
> >> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
> >> 
> >> The problem comes from the function get_parent_handle(...) that inside
> >> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
> >> is NULL, making Xen stop the UEFI boot.
> > 
> > According to my reading the UEFI spec doesn't (explicitly) allow for
> > this to be NULL. Could you clarify why this is the case? What other
> > information may end up being invalid / absent? Is e.g. read_section()
> > safe to use?
> 
> My test on an arm machine running Grub2 on top of EDK2 showed that
> when Xen is started, the get_parent_handle(…) call was failing and stopping
> the boot because the efi_bs->HandleProtocol(…) was called with the
> loaded_image->DeviceHandle argument NULL and the call was returning
> a EFI_INVALID_PARAMETER.
> So the parent handle can’t be requested and the filesystem can’t be used,
> but any other code that doesn’t use the handle provided by get_parent_handle(…)
> can be used without problem like read_section(...).

It could be the case that Grub2 is doing something not entirely
compliant to the spec.


> > 
> >> --- a/xen/common/efi/boot.c
> >> +++ b/xen/common/efi/boot.c
> >> @@ -449,6 +449,13 @@ static EFI_FILE_HANDLE __init get_parent_handle(EFI_LOADED_IMAGE *loaded_image,
> >>     CHAR16 *pathend, *ptr;
> >>     EFI_STATUS ret;
> >> 
> >> +    /*
> >> +     * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
> >> +     * to have access to the filesystem.
> >> +     */
> >> +    if ( !loaded_image->DeviceHandle )
> >> +        return NULL;
> > 
> > I couldn't find anything in the spec saying that NULL (a pointer with
> > the numeric value zero) could actually not be a valid handle. Could
> > you point me to text saying so?
> 
> I am reading UEFI spec 2.8 A, section 7.3 Protocol Handler Services, when it talks about
> EFI_BOOT_SERVICES.HandleProtocol() there is a table of “Status Code Returned” listing
> the EFI_INVALID_PARAMETER when the Handle is NULL.
> 
> > 
> >> @@ -581,6 +588,8 @@ static bool __init read_file(EFI_FILE_HANDLE dir_handle, CHAR16 *name,
> >>     EFI_STATUS ret;
> >>     const CHAR16 *what = NULL;
> >> 
> >> +    if ( !dir_handle )
> >> +        blexit(L"Error: No access to the filesystem");
> > 
> > dir_handle also gets passed to efi_arch_cfg_file_{early,late}() -
> > those don't need any adjustment only because they merely pass the
> > parameter on to read_file()?
> 
> Yes, the handling is done in read_file(...)

But it is not super obvious, that's one I suggested an additional
explicit check in my other email
Jan Beulich Nov. 3, 2021, 8:20 a.m. UTC | #5
On 02.11.2021 18:12, Luca Fancellu wrote:
>> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
>> On 02.11.2021 15:05, Luca Fancellu wrote:
>>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>>>
>>> The problem comes from the function get_parent_handle(...) that inside
>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>> is NULL, making Xen stop the UEFI boot.
>>
>> According to my reading the UEFI spec doesn't (explicitly) allow for
>> this to be NULL. Could you clarify why this is the case? What other
>> information may end up being invalid / absent? Is e.g. read_section()
>> safe to use?
> 
> My test on an arm machine running Grub2 on top of EDK2 showed that
> when Xen is started, the get_parent_handle(…) call was failing and stopping
> the boot because the efi_bs->HandleProtocol(…) was called with the
> loaded_image->DeviceHandle argument NULL and the call was returning
> a EFI_INVALID_PARAMETER.
> So the parent handle can’t be requested and the filesystem can’t be used,
> but any other code that doesn’t use the handle provided by get_parent_handle(…)
> can be used without problem like read_section(...).

I understand this. My question was for the reason of ->DeviceHandle
being NULL. IOW I'm wondering whether we're actually talking about a
firmware or GrUB bug, in which case your change is a workaround for
that rather than (primarily) a fix for the earlier Xen change.

>>> --- a/xen/common/efi/boot.c
>>> +++ b/xen/common/efi/boot.c
>>> @@ -449,6 +449,13 @@ static EFI_FILE_HANDLE __init get_parent_handle(EFI_LOADED_IMAGE *loaded_image,
>>>     CHAR16 *pathend, *ptr;
>>>     EFI_STATUS ret;
>>>
>>> +    /*
>>> +     * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>>> +     * to have access to the filesystem.
>>> +     */
>>> +    if ( !loaded_image->DeviceHandle )
>>> +        return NULL;
>>
>> I couldn't find anything in the spec saying that NULL (a pointer with
>> the numeric value zero) could actually not be a valid handle. Could
>> you point me to text saying so?
> 
> I am reading UEFI spec 2.8 A, section 7.3 Protocol Handler Services, when it talks about
> EFI_BOOT_SERVICES.HandleProtocol() there is a table of “Status Code Returned” listing
> the EFI_INVALID_PARAMETER when the Handle is NULL.

Oh, okay. I guess I didn't search very well.

>>> @@ -1333,6 +1342,9 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
>>>             EFI_FILE_HANDLE handle = get_parent_handle(loaded_image,
>>>                                                        &file_name);
>>>
>>> +            if ( !handle )
>>> +                blexit(L"Error retrieving image name: no filesystem access");
>>
>> I don't think this should be fatal - see the comment ahead of the
>> enclosing if().
> 
> I’m not sure I get it, I put the fatal condition in part because the handle was dereferenced by
> handle->Close(handle), but also because file_name would have not being modified by the call
> and we have then *argv = file_name.

Instead of you making boot fail I was trying to suggest that you insert
a made-up name ("xen" or "xen.efi"?) alongside issuing just a warning
message.

Jan
Luca Fancellu Nov. 3, 2021, 10:20 a.m. UTC | #6
> On 3 Nov 2021, at 08:20, Jan Beulich <jbeulich@suse.com> wrote:
> 
> On 02.11.2021 18:12, Luca Fancellu wrote:
>>> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
>>> On 02.11.2021 15:05, Luca Fancellu wrote:
>>>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>>>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>>>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>>>> 
>>>> The problem comes from the function get_parent_handle(...) that inside
>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>> is NULL, making Xen stop the UEFI boot.
>>> 
>>> According to my reading the UEFI spec doesn't (explicitly) allow for
>>> this to be NULL. Could you clarify why this is the case? What other
>>> information may end up being invalid / absent? Is e.g. read_section()
>>> safe to use?
>> 
>> My test on an arm machine running Grub2 on top of EDK2 showed that
>> when Xen is started, the get_parent_handle(…) call was failing and stopping
>> the boot because the efi_bs->HandleProtocol(…) was called with the
>> loaded_image->DeviceHandle argument NULL and the call was returning
>> a EFI_INVALID_PARAMETER.
>> So the parent handle can’t be requested and the filesystem can’t be used,
>> but any other code that doesn’t use the handle provided by get_parent_handle(…)
>> can be used without problem like read_section(...).
> 
> I understand this. My question was for the reason of ->DeviceHandle
> being NULL. IOW I'm wondering whether we're actually talking about a
> firmware or GrUB bug, in which case your change is a workaround for
> that rather than (primarily) a fix for the earlier Xen change.

The issue was found only when using EDK2+Grub2, no issue when booting
directly from EDK2.
This is a fix for the regression, because without the EFI changes, Grub2 was
booting successfully Xen. Using grub2 to boot Xen on arm is a very common
solution so not supporting this anymore could lead to lots of people having
issues if they update to Xen 4.16.

> 
>>>> --- a/xen/common/efi/boot.c
>>>> +++ b/xen/common/efi/boot.c
>>>> @@ -449,6 +449,13 @@ static EFI_FILE_HANDLE __init get_parent_handle(EFI_LOADED_IMAGE *loaded_image,
>>>>    CHAR16 *pathend, *ptr;
>>>>    EFI_STATUS ret;
>>>> 
>>>> +    /*
>>>> +     * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>>>> +     * to have access to the filesystem.
>>>> +     */
>>>> +    if ( !loaded_image->DeviceHandle )
>>>> +        return NULL;
>>> 
>>> I couldn't find anything in the spec saying that NULL (a pointer with
>>> the numeric value zero) could actually not be a valid handle. Could
>>> you point me to text saying so?
>> 
>> I am reading UEFI spec 2.8 A, section 7.3 Protocol Handler Services, when it talks about
>> EFI_BOOT_SERVICES.HandleProtocol() there is a table of “Status Code Returned” listing
>> the EFI_INVALID_PARAMETER when the Handle is NULL.
> 
> Oh, okay. I guess I didn't search very well.
> 
>>>> @@ -1333,6 +1342,9 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
>>>>            EFI_FILE_HANDLE handle = get_parent_handle(loaded_image,
>>>>                                                       &file_name);
>>>> 
>>>> +            if ( !handle )
>>>> +                blexit(L"Error retrieving image name: no filesystem access");
>>> 
>>> I don't think this should be fatal - see the comment ahead of the
>>> enclosing if().
>> 
>> I’m not sure I get it, I put the fatal condition in part because the handle was dereferenced by
>> handle->Close(handle), but also because file_name would have not being modified by the call
>> and we have then *argv = file_name.
> 
> Instead of you making boot fail I was trying to suggest that you insert
> a made-up name ("xen" or "xen.efi"?) alongside issuing just a warning
> message.

Oh ok now I see, yes I think I can do it

> 
> Jan
>
Luca Fancellu Nov. 3, 2021, 10:29 a.m. UTC | #7
> On 2 Nov 2021, at 23:17, Stefano Stabellini <sstabellini@kernel.org> wrote:
> 
> On Tue, 2 Nov 2021, Luca Fancellu wrote:
>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>> 
>> The problem comes from the function get_parent_handle(...) that inside
>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>> is NULL, making Xen stop the UEFI boot.
>> 
>> Before the commit above, the function was never called because the
>> logic was skipping the call when there were multiboot modules in the
>> DT because the filesystem was never used and the bootloader had
>> put in place all the right modules in memory and the addresses
>> in the DT.
>> 
>> To fix the problem we allow the get_parent_handle(...) function to
>> return a NULL handle on error and we check the usage of the function
>> to handle the new use case. The function in fact should not prevent
>> the boot even if the filesystem can't be used, because the DT and
>> the modules could be put in place by the bootloader before running
>> Xen and if xen,uefi-binary property is not used, there is no need
>> for the filesystem.
>> 
>> Another problem is found when the UEFI stub tries to check if Dom0
>> image or DomUs are present.
>> The logic doesn't work when the UEFI stub is not responsible to load
>> any modules, so the efi_check_dt_boot(...) return value is modified
>> to return the number of multiboot module found and not only the number
>> of module loaded by the stub.
>> 
>> Fixes: a1743fc3a9 ("arm/efi: Use dom0less configuration when using EFI boot")
>> Signed-off-by: Luca Fancellu <luca.fancellu@arm.com>
>> ---
>> Justification for integration in 4.16:
>> Upside: allow booting xen from grub on arm64 when the stub doesn't load
>>        any module.
>> Downside: It's affecting the EFI boot path.
>> Risk: It's not affecting x86 arch that works the same way as before.
>>      If something is wrong it creates a problem on early boot and not at
>>      runtime, so risk is low.
>> 
>> Tested in this configurations:
>> - Bootloader loads modules and specify them as multiboot modules in DT:
>>   * combination of Dom0, DomUs, Dom0 and DomUs
>> - DT specifies multiboot modules in DT using xen,uefi-binary property:
>>   * combination of Dom0, DomUs, Dom0 and DomUs
>> - Bootloader loads a Dom0 module and appends it as multiboot module in DT,
>>   other multiboot modules are listed for DomUs using xen,uefi-binary
>> - No multiboot modules in DT and no kernel entry in cfg file:
>>   * proper error thrown
>> 
>> ---
>> xen/arch/arm/efi/efi-boot.h | 28 ++++++++++++++++++----------
>> xen/common/efi/boot.c       | 15 ++++++++++++++-
>> 2 files changed, 32 insertions(+), 11 deletions(-)
>> 
>> diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
>> index 8b88dd26a5..e714b2b44c 100644
>> --- a/xen/arch/arm/efi/efi-boot.h
>> +++ b/xen/arch/arm/efi/efi-boot.h
>> @@ -51,9 +51,11 @@ static int handle_module_node(EFI_FILE_HANDLE dir_handle,
>>                               int module_node_offset,
>>                               int reg_addr_cells,
>>                               int reg_size_cells,
>> -                              bool is_domu_module);
>> +                              bool is_domu_module,
>> +                              unsigned int *modules_found);
>> static int handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
>> -                                       int domain_node);
>> +                                       int domain_node,
>> +                                       unsigned int *modules_found);
>> static int efi_check_dt_boot(EFI_FILE_HANDLE dir_handle);
>> 
>> #define DEVICE_TREE_GUID \
>> @@ -707,7 +709,8 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
>>                                      int module_node_offset,
>>                                      int reg_addr_cells,
>>                                      int reg_size_cells,
>> -                                     bool is_domu_module)
>> +                                     bool is_domu_module,
>> +                                     unsigned int *modules_found)
>> {
>>     const void *uefi_name_prop;
>>     char mod_string[24]; /* Placeholder for module@ + a 64-bit number + \0 */
>> @@ -725,6 +728,9 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
>>         /* Module is not a multiboot,module */
>>         return 0;
>> 
>> +    /* Count the multiboot module as found */
>> +    (*modules_found)++;
>> +
>>     /* Read xen,uefi-binary property to get the file name. */
>>     uefi_name_prop = fdt_getprop(fdt, module_node_offset, "xen,uefi-binary",
>>                                  &uefi_name_len);
>> @@ -804,7 +810,8 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
>>  * Returns 0 on success, negative number on error.
>>  */
>> static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
>> -                                              int domain_node)
>> +                                              int domain_node,
>> +                                              unsigned int *modules_found)
>> {
>>     int module_node, addr_cells, size_cells, len;
>>     const struct fdt_property *prop;
>> @@ -834,7 +841,7 @@ static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
>>           module_node = fdt_next_subnode(fdt, module_node) )
>>     {
>>         int ret = handle_module_node(dir_handle, module_node, addr_cells,
>> -                                     size_cells, true);
>> +                                     size_cells, true, modules_found);
>>         if ( ret < 0 )
>>             return ret;
>>     }
>> @@ -845,12 +852,12 @@ static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
>> /*
>>  * This function checks for xen domain nodes under the /chosen node for possible
>>  * dom0 and domU guests to be loaded.
>> - * Returns the number of modules loaded or a negative number for error.
>> + * Returns the number of multiboot modules found or a negative number for error.
>>  */
>> static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
>> {
>>     int chosen, node, addr_len, size_len;
>> -    unsigned int i = 0;
>> +    unsigned int i = 0, modules_found = 0;
>> 
>>     /* Check for the chosen node in the current DTB */
>>     chosen = setup_chosen_node(fdt, &addr_len, &size_len);
>> @@ -868,11 +875,12 @@ static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
>>         if ( !fdt_node_check_compatible(fdt, node, "xen,domain") )
>>         {
>>             /* Found a node with compatible xen,domain; handle this node. */
>> -            if ( handle_dom0less_domain_node(dir_handle, node) < 0 )
>> +            if ( handle_dom0less_domain_node(dir_handle, node,
>> +                                             &modules_found) < 0 )
>>                 return ERROR_DT_MODULE_DOMU;
>>         }
>>         else if ( handle_module_node(dir_handle, node, addr_len, size_len,
>> -                                     false) < 0 )
>> +                                     false, &modules_found) < 0 )
>>                  return ERROR_DT_MODULE_DOM0;
> 
> I think there is no need to add modules_found to the parameters of
> handle_dom0less_domain_node and handle_module_node. You could just
> increment modules_found here for every iteration of the loop where
> there is no error.  You would have to change a couple of returns in
> handle_module_node, just to give you the idea:

Yes we could do that but when we handle a xen,domain node we will count
only one module and that defeats the aim to count every multiboot,module.

If we want to continue with your proposal let me know and I will implement it.

> 
> 
> diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
> index e714b2b44c..7739789c41 100644
> --- a/xen/arch/arm/efi/efi-boot.h
> +++ b/xen/arch/arm/efi/efi-boot.h
> @@ -726,7 +726,7 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
> 
>     if ( module_compat != 0 )
>         /* Module is not a multiboot,module */
> -        return 0;
> +        return 1;
> 
>     /* Count the multiboot module as found */
>     (*modules_found)++;
> @@ -737,7 +737,7 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
> 
>     if ( !uefi_name_prop )
>         /* Property not found */
> -        return 0;
> +        return 1;
> 
>     file_idx = get_module_file_index(uefi_name_prop, uefi_name_len);
>     if ( file_idx < 0 )
> 
> 
>>     }
>> 
>> @@ -883,7 +891,7 @@ static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
>>         efi_bs->FreePool(modules[i].name);
>>     }
>> 
>> -    return modules_idx;
>> +    return modules_found;
>> }
>> 
>> static void __init efi_arch_cpu(void)
>> diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c
>> index 392ff3ac9b..495e7a4096 100644
>> --- a/xen/common/efi/boot.c
>> +++ b/xen/common/efi/boot.c
>> @@ -449,6 +449,13 @@ static EFI_FILE_HANDLE __init get_parent_handle(EFI_LOADED_IMAGE *loaded_image,
>>     CHAR16 *pathend, *ptr;
>>     EFI_STATUS ret;
>> 
>> +    /*
>> +     * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>> +     * to have access to the filesystem.
>> +     */
>> +    if ( !loaded_image->DeviceHandle )
>> +        return NULL;
>> +
>>     do {
>>         EFI_FILE_IO_INTERFACE *fio;
>> 
>> @@ -581,6 +588,8 @@ static bool __init read_file(EFI_FILE_HANDLE dir_handle, CHAR16 *name,
>>     EFI_STATUS ret;
>>     const CHAR16 *what = NULL;
>> 
>> +    if ( !dir_handle )
>> +        blexit(L"Error: No access to the filesystem");
>>     if ( !name )
>>         PrintErrMesg(L"No filename", EFI_OUT_OF_RESOURCES);
>>     ret = dir_handle->Open(dir_handle, &FileHandle, name,
>> @@ -1333,6 +1342,9 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
>>             EFI_FILE_HANDLE handle = get_parent_handle(loaded_image,
>>                                                        &file_name);
>> 
>> +            if ( !handle )
>> +                blexit(L"Error retrieving image name: no filesystem access");
> 
> I think it would be nice to have an other explicit check like this one
> at the beginning of if ( use_cfg_file ) to make sure dir_handle is not
> null in that case. If I am not mistaken, if we take the use_cfg_file
> path, dir_handle has to be valid, right?

Dir_handle could be invalid and we would be able to boot successfully when we take everywhere
the path using read_section, for that reason I didn’t stop the boot earlier.
Given Jan suggestion that check could be also modified to be something like “if there is no handle, *argv=“xen.efi” "
so the boot can continue without problem if we don’t need to read anything from the filesystem.

> 
> 
>>             handle->Close(handle);
>>             *argv = file_name;
>>         }
>> @@ -1369,7 +1381,8 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
>>     /* Get the number of boot modules specified on the DT or an error (<0) */
>>     dt_modules_found = efi_check_dt_boot(dir_handle);
>> 
>> -    dir_handle->Close(dir_handle);
>> +    if ( dir_handle )
>> +        dir_handle->Close(dir_handle);
>> 
>>     if ( dt_modules_found < 0 )
>>         /* efi_check_dt_boot throws some error */
>> -- 
>> 2.17.1
Jan Beulich Nov. 3, 2021, 11:28 a.m. UTC | #8
On 03.11.2021 11:20, Luca Fancellu wrote:
> 
> 
>> On 3 Nov 2021, at 08:20, Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 02.11.2021 18:12, Luca Fancellu wrote:
>>>> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 02.11.2021 15:05, Luca Fancellu wrote:
>>>>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>>>>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>>>>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>>>>>
>>>>> The problem comes from the function get_parent_handle(...) that inside
>>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>>> is NULL, making Xen stop the UEFI boot.
>>>>
>>>> According to my reading the UEFI spec doesn't (explicitly) allow for
>>>> this to be NULL. Could you clarify why this is the case? What other
>>>> information may end up being invalid / absent? Is e.g. read_section()
>>>> safe to use?
>>>
>>> My test on an arm machine running Grub2 on top of EDK2 showed that
>>> when Xen is started, the get_parent_handle(…) call was failing and stopping
>>> the boot because the efi_bs->HandleProtocol(…) was called with the
>>> loaded_image->DeviceHandle argument NULL and the call was returning
>>> a EFI_INVALID_PARAMETER.
>>> So the parent handle can’t be requested and the filesystem can’t be used,
>>> but any other code that doesn’t use the handle provided by get_parent_handle(…)
>>> can be used without problem like read_section(...).
>>
>> I understand this. My question was for the reason of ->DeviceHandle
>> being NULL. IOW I'm wondering whether we're actually talking about a
>> firmware or GrUB bug, in which case your change is a workaround for
>> that rather than (primarily) a fix for the earlier Xen change.
> 
> The issue was found only when using EDK2+Grub2, no issue when booting
> directly from EDK2.
> This is a fix for the regression, because without the EFI changes, Grub2 was
> booting successfully Xen. Using grub2 to boot Xen on arm is a very common
> solution so not supporting this anymore could lead to lots of people having
> issues if they update to Xen 4.16.

I'm not objecting to addressing the issue. But the description needs
to make clear where the origin of the problem lies, and afaict that's
not the earlier Xen change. That one merely uncovered what, according
to your reply, might then be a GrUB bug. Unless, as said earlier, I
merely haven't been able to spot provisions in the spec for the field
in question to be NULL.

Jan
Luca Fancellu Nov. 3, 2021, 2:09 p.m. UTC | #9
> On 3 Nov 2021, at 11:28, Jan Beulich <jbeulich@suse.com> wrote:
> 
> On 03.11.2021 11:20, Luca Fancellu wrote:
>> 
>> 
>>> On 3 Nov 2021, at 08:20, Jan Beulich <jbeulich@suse.com> wrote:
>>> 
>>> On 02.11.2021 18:12, Luca Fancellu wrote:
>>>>> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
>>>>> On 02.11.2021 15:05, Luca Fancellu wrote:
>>>>>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>>>>>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>>>>>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>>>>>> 
>>>>>> The problem comes from the function get_parent_handle(...) that inside
>>>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>>>> is NULL, making Xen stop the UEFI boot.
>>>>> 
>>>>> According to my reading the UEFI spec doesn't (explicitly) allow for
>>>>> this to be NULL. Could you clarify why this is the case? What other
>>>>> information may end up being invalid / absent? Is e.g. read_section()
>>>>> safe to use?
>>>> 
>>>> My test on an arm machine running Grub2 on top of EDK2 showed that
>>>> when Xen is started, the get_parent_handle(…) call was failing and stopping
>>>> the boot because the efi_bs->HandleProtocol(…) was called with the
>>>> loaded_image->DeviceHandle argument NULL and the call was returning
>>>> a EFI_INVALID_PARAMETER.
>>>> So the parent handle can’t be requested and the filesystem can’t be used,
>>>> but any other code that doesn’t use the handle provided by get_parent_handle(…)
>>>> can be used without problem like read_section(...).
>>> 
>>> I understand this. My question was for the reason of ->DeviceHandle
>>> being NULL. IOW I'm wondering whether we're actually talking about a
>>> firmware or GrUB bug, in which case your change is a workaround for
>>> that rather than (primarily) a fix for the earlier Xen change.
>> 
>> The issue was found only when using EDK2+Grub2, no issue when booting
>> directly from EDK2.
>> This is a fix for the regression, because without the EFI changes, Grub2 was
>> booting successfully Xen. Using grub2 to boot Xen on arm is a very common
>> solution so not supporting this anymore could lead to lots of people having
>> issues if they update to Xen 4.16.
> 
> I'm not objecting to addressing the issue. But the description needs
> to make clear where the origin of the problem lies, and afaict that's
> not the earlier Xen change. That one merely uncovered what, according
> to your reply, might then be a GrUB bug. Unless, as said earlier, I
> merely haven't been able to spot provisions in the spec for the field
> in question to be NULL.

Maybe I can rephrase to be more specific from:

The problem comes from the function get_parent_handle(...) that inside
uses the HandleProtocol on loaded_image->DeviceHandle, but the last
is NULL, making Xen stop the UEFI boot.

To:

Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle,
that is used by efi_bs->HandleProtocol(…) inside get_parent_handle(…),
causing Xen to stop the boot getting an EFI_INVALID_PARAMETER error.

Do you think it can be ok like this?

Cheers,
Luca

> 
> Jan
Jan Beulich Nov. 3, 2021, 2:30 p.m. UTC | #10
On 03.11.2021 15:09, Luca Fancellu wrote:
>> On 3 Nov 2021, at 11:28, Jan Beulich <jbeulich@suse.com> wrote:
>> On 03.11.2021 11:20, Luca Fancellu wrote:
>>>> On 3 Nov 2021, at 08:20, Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 02.11.2021 18:12, Luca Fancellu wrote:
>>>>>> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>> On 02.11.2021 15:05, Luca Fancellu wrote:
>>>>>>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>>>>>>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>>>>>>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>>>>>>>
>>>>>>> The problem comes from the function get_parent_handle(...) that inside
>>>>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>>>>> is NULL, making Xen stop the UEFI boot.
>>>>>>
>>>>>> According to my reading the UEFI spec doesn't (explicitly) allow for
>>>>>> this to be NULL. Could you clarify why this is the case? What other
>>>>>> information may end up being invalid / absent? Is e.g. read_section()
>>>>>> safe to use?
>>>>>
>>>>> My test on an arm machine running Grub2 on top of EDK2 showed that
>>>>> when Xen is started, the get_parent_handle(…) call was failing and stopping
>>>>> the boot because the efi_bs->HandleProtocol(…) was called with the
>>>>> loaded_image->DeviceHandle argument NULL and the call was returning
>>>>> a EFI_INVALID_PARAMETER.
>>>>> So the parent handle can’t be requested and the filesystem can’t be used,
>>>>> but any other code that doesn’t use the handle provided by get_parent_handle(…)
>>>>> can be used without problem like read_section(...).
>>>>
>>>> I understand this. My question was for the reason of ->DeviceHandle
>>>> being NULL. IOW I'm wondering whether we're actually talking about a
>>>> firmware or GrUB bug, in which case your change is a workaround for
>>>> that rather than (primarily) a fix for the earlier Xen change.
>>>
>>> The issue was found only when using EDK2+Grub2, no issue when booting
>>> directly from EDK2.
>>> This is a fix for the regression, because without the EFI changes, Grub2 was
>>> booting successfully Xen. Using grub2 to boot Xen on arm is a very common
>>> solution so not supporting this anymore could lead to lots of people having
>>> issues if they update to Xen 4.16.
>>
>> I'm not objecting to addressing the issue. But the description needs
>> to make clear where the origin of the problem lies, and afaict that's
>> not the earlier Xen change. That one merely uncovered what, according
>> to your reply, might then be a GrUB bug. Unless, as said earlier, I
>> merely haven't been able to spot provisions in the spec for the field
>> in question to be NULL.
> 
> Maybe I can rephrase to be more specific from:
> 
> The problem comes from the function get_parent_handle(...) that inside
> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
> is NULL, making Xen stop the UEFI boot.
> 
> To:
> 
> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle,
> that is used by efi_bs->HandleProtocol(…) inside get_parent_handle(…),
> causing Xen to stop the boot getting an EFI_INVALID_PARAMETER error.
> 
> Do you think it can be ok like this?

Much better, yes, but I wonder what "returning" refers to. You want to
describe the origin of the NULL handle as precisely as possible. And
considering this turns out as a workaround, in a suitable place you
will also want to add a code comment, such that a later reader won't
decide this is all dead code and can be done in a simpler way.

Jan
Luca Fancellu Nov. 3, 2021, 3:16 p.m. UTC | #11
> On 3 Nov 2021, at 14:30, Jan Beulich <jbeulich@suse.com> wrote:
> 
> On 03.11.2021 15:09, Luca Fancellu wrote:
>>> On 3 Nov 2021, at 11:28, Jan Beulich <jbeulich@suse.com> wrote:
>>> On 03.11.2021 11:20, Luca Fancellu wrote:
>>>>> On 3 Nov 2021, at 08:20, Jan Beulich <jbeulich@suse.com> wrote:
>>>>> On 02.11.2021 18:12, Luca Fancellu wrote:
>>>>>>> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>> On 02.11.2021 15:05, Luca Fancellu wrote:
>>>>>>>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>>>>>>>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>>>>>>>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>>>>>>>> 
>>>>>>>> The problem comes from the function get_parent_handle(...) that inside
>>>>>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>>>>>> is NULL, making Xen stop the UEFI boot.
>>>>>>> 
>>>>>>> According to my reading the UEFI spec doesn't (explicitly) allow for
>>>>>>> this to be NULL. Could you clarify why this is the case? What other
>>>>>>> information may end up being invalid / absent? Is e.g. read_section()
>>>>>>> safe to use?
>>>>>> 
>>>>>> My test on an arm machine running Grub2 on top of EDK2 showed that
>>>>>> when Xen is started, the get_parent_handle(…) call was failing and stopping
>>>>>> the boot because the efi_bs->HandleProtocol(…) was called with the
>>>>>> loaded_image->DeviceHandle argument NULL and the call was returning
>>>>>> a EFI_INVALID_PARAMETER.
>>>>>> So the parent handle can’t be requested and the filesystem can’t be used,
>>>>>> but any other code that doesn’t use the handle provided by get_parent_handle(…)
>>>>>> can be used without problem like read_section(...).
>>>>> 
>>>>> I understand this. My question was for the reason of ->DeviceHandle
>>>>> being NULL. IOW I'm wondering whether we're actually talking about a
>>>>> firmware or GrUB bug, in which case your change is a workaround for
>>>>> that rather than (primarily) a fix for the earlier Xen change.
>>>> 
>>>> The issue was found only when using EDK2+Grub2, no issue when booting
>>>> directly from EDK2.
>>>> This is a fix for the regression, because without the EFI changes, Grub2 was
>>>> booting successfully Xen. Using grub2 to boot Xen on arm is a very common
>>>> solution so not supporting this anymore could lead to lots of people having
>>>> issues if they update to Xen 4.16.
>>> 
>>> I'm not objecting to addressing the issue. But the description needs
>>> to make clear where the origin of the problem lies, and afaict that's
>>> not the earlier Xen change. That one merely uncovered what, according
>>> to your reply, might then be a GrUB bug. Unless, as said earlier, I
>>> merely haven't been able to spot provisions in the spec for the field
>>> in question to be NULL.
>> 
>> Maybe I can rephrase to be more specific from:
>> 
>> The problem comes from the function get_parent_handle(...) that inside
>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>> is NULL, making Xen stop the UEFI boot.
>> 
>> To:
>> 
>> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle,
>> that is used by efi_bs->HandleProtocol(…) inside get_parent_handle(…),
>> causing Xen to stop the boot getting an EFI_INVALID_PARAMETER error.
>> 
>> Do you think it can be ok like this?
> 
> Much better, yes, but I wonder what "returning" refers to. You want to
> describe the origin of the NULL handle as precisely as possible. And
> considering this turns out as a workaround, in a suitable place you
> will also want to add a code comment, such that a later reader won't
> decide this is all dead code and can be done in a simpler way.

Ok I can write the issue from the beginning to be sure:

Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle
inside the interface given by the LOADED_IMAGE_PROTOCOL service, this
handle is used later by efi_bs->HandleProtocol(…) inside get_parent_handle(…)
when requesting the SIMPLE_FILE_SYSTEM_PROTOCOL interface,
causing Xen to stop the boot because of an EFI_INVALID_PARAMETER error.

Regarding the comment, I can rephrase this comment:
/*
 * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
 * to have access to the filesystem.
 */

To be:

/*
 * If DeviceHandle is NULL, the firmware offering the UEFI services might not be
 * compliant to the standard and we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
 * to have access to the filesystem. However the system can boot if and only if it doesn’t
 * require access to the filesystem. (e.g. Xen image has everything built in or the
 * bootloader did previously load every needed binary in memory)
 */

What do you think?

Cheers,
Luca

> 
> Jan
Jan Beulich Nov. 3, 2021, 3:30 p.m. UTC | #12
On 03.11.2021 16:16, Luca Fancellu wrote:
> 
> 
>> On 3 Nov 2021, at 14:30, Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 03.11.2021 15:09, Luca Fancellu wrote:
>>>> On 3 Nov 2021, at 11:28, Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 03.11.2021 11:20, Luca Fancellu wrote:
>>>>>> On 3 Nov 2021, at 08:20, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>> On 02.11.2021 18:12, Luca Fancellu wrote:
>>>>>>>> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>>> On 02.11.2021 15:05, Luca Fancellu wrote:
>>>>>>>>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>>>>>>>>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>>>>>>>>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>>>>>>>>>
>>>>>>>>> The problem comes from the function get_parent_handle(...) that inside
>>>>>>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>>>>>>> is NULL, making Xen stop the UEFI boot.
>>>>>>>>
>>>>>>>> According to my reading the UEFI spec doesn't (explicitly) allow for
>>>>>>>> this to be NULL. Could you clarify why this is the case? What other
>>>>>>>> information may end up being invalid / absent? Is e.g. read_section()
>>>>>>>> safe to use?
>>>>>>>
>>>>>>> My test on an arm machine running Grub2 on top of EDK2 showed that
>>>>>>> when Xen is started, the get_parent_handle(…) call was failing and stopping
>>>>>>> the boot because the efi_bs->HandleProtocol(…) was called with the
>>>>>>> loaded_image->DeviceHandle argument NULL and the call was returning
>>>>>>> a EFI_INVALID_PARAMETER.
>>>>>>> So the parent handle can’t be requested and the filesystem can’t be used,
>>>>>>> but any other code that doesn’t use the handle provided by get_parent_handle(…)
>>>>>>> can be used without problem like read_section(...).
>>>>>>
>>>>>> I understand this. My question was for the reason of ->DeviceHandle
>>>>>> being NULL. IOW I'm wondering whether we're actually talking about a
>>>>>> firmware or GrUB bug, in which case your change is a workaround for
>>>>>> that rather than (primarily) a fix for the earlier Xen change.
>>>>>
>>>>> The issue was found only when using EDK2+Grub2, no issue when booting
>>>>> directly from EDK2.
>>>>> This is a fix for the regression, because without the EFI changes, Grub2 was
>>>>> booting successfully Xen. Using grub2 to boot Xen on arm is a very common
>>>>> solution so not supporting this anymore could lead to lots of people having
>>>>> issues if they update to Xen 4.16.
>>>>
>>>> I'm not objecting to addressing the issue. But the description needs
>>>> to make clear where the origin of the problem lies, and afaict that's
>>>> not the earlier Xen change. That one merely uncovered what, according
>>>> to your reply, might then be a GrUB bug. Unless, as said earlier, I
>>>> merely haven't been able to spot provisions in the spec for the field
>>>> in question to be NULL.
>>>
>>> Maybe I can rephrase to be more specific from:
>>>
>>> The problem comes from the function get_parent_handle(...) that inside
>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>> is NULL, making Xen stop the UEFI boot.
>>>
>>> To:
>>>
>>> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle,
>>> that is used by efi_bs->HandleProtocol(…) inside get_parent_handle(…),
>>> causing Xen to stop the boot getting an EFI_INVALID_PARAMETER error.
>>>
>>> Do you think it can be ok like this?
>>
>> Much better, yes, but I wonder what "returning" refers to. You want to
>> describe the origin of the NULL handle as precisely as possible. And
>> considering this turns out as a workaround, in a suitable place you
>> will also want to add a code comment, such that a later reader won't
>> decide this is all dead code and can be done in a simpler way.
> 
> Ok I can write the issue from the beginning to be sure:
> 
> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle
> inside the interface given by the LOADED_IMAGE_PROTOCOL service, this
> handle is used later by efi_bs->HandleProtocol(…) inside get_parent_handle(…)
> when requesting the SIMPLE_FILE_SYSTEM_PROTOCOL interface,
> causing Xen to stop the boot because of an EFI_INVALID_PARAMETER error.
> 
> Regarding the comment, I can rephrase this comment:
> /*
>  * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>  * to have access to the filesystem.
>  */
> 
> To be:
> 
> /*
>  * If DeviceHandle is NULL, the firmware offering the UEFI services might not be
>  * compliant to the standard and we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>  * to have access to the filesystem. However the system can boot if and only if it doesn’t
>  * require access to the filesystem. (e.g. Xen image has everything built in or the
>  * bootloader did previously load every needed binary in memory)
>  */
> 
> What do you think?

Largely okay, albeit you don't mention GrUB at all (which isn't really part
of the firmware, but which looks to be the culprit) and it gets a little
too verbose. Provided the facts have been verified, how about

    /*
     * GrUB has been observed to supply a NULL DeviceHandle. We can't use
     * that to gain access to the filesystem. However the system can still
     * boot if it doesn’t require access to the filesystem.
     */

(and it's up to you whether you include your further "e.g. ..." then, but
if you do I think the parenthesized part belong before the final full
stop, and the last "in" would want to be "into")? It's still dubious to me
how they can get away with such a NULL handle (and why that happens only
on Arm), but I guess it would go too far to try to understand the
background.

Jan
Bertrand Marquis Nov. 3, 2021, 3:41 p.m. UTC | #13
Hi Jan,

> On 3 Nov 2021, at 15:30, Jan Beulich <jbeulich@suse.com> wrote:
> 
> On 03.11.2021 16:16, Luca Fancellu wrote:
>> 
>> 
>>> On 3 Nov 2021, at 14:30, Jan Beulich <jbeulich@suse.com> wrote:
>>> 
>>> On 03.11.2021 15:09, Luca Fancellu wrote:
>>>>> On 3 Nov 2021, at 11:28, Jan Beulich <jbeulich@suse.com> wrote:
>>>>> On 03.11.2021 11:20, Luca Fancellu wrote:
>>>>>>> On 3 Nov 2021, at 08:20, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>> On 02.11.2021 18:12, Luca Fancellu wrote:
>>>>>>>>> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>>>> On 02.11.2021 15:05, Luca Fancellu wrote:
>>>>>>>>>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>>>>>>>>>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>>>>>>>>>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>>>>>>>>>> 
>>>>>>>>>> The problem comes from the function get_parent_handle(...) that inside
>>>>>>>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>>>>>>>> is NULL, making Xen stop the UEFI boot.
>>>>>>>>> 
>>>>>>>>> According to my reading the UEFI spec doesn't (explicitly) allow for
>>>>>>>>> this to be NULL. Could you clarify why this is the case? What other
>>>>>>>>> information may end up being invalid / absent? Is e.g. read_section()
>>>>>>>>> safe to use?
>>>>>>>> 
>>>>>>>> My test on an arm machine running Grub2 on top of EDK2 showed that
>>>>>>>> when Xen is started, the get_parent_handle(…) call was failing and stopping
>>>>>>>> the boot because the efi_bs->HandleProtocol(…) was called with the
>>>>>>>> loaded_image->DeviceHandle argument NULL and the call was returning
>>>>>>>> a EFI_INVALID_PARAMETER.
>>>>>>>> So the parent handle can’t be requested and the filesystem can’t be used,
>>>>>>>> but any other code that doesn’t use the handle provided by get_parent_handle(…)
>>>>>>>> can be used without problem like read_section(...).
>>>>>>> 
>>>>>>> I understand this. My question was for the reason of ->DeviceHandle
>>>>>>> being NULL. IOW I'm wondering whether we're actually talking about a
>>>>>>> firmware or GrUB bug, in which case your change is a workaround for
>>>>>>> that rather than (primarily) a fix for the earlier Xen change.
>>>>>> 
>>>>>> The issue was found only when using EDK2+Grub2, no issue when booting
>>>>>> directly from EDK2.
>>>>>> This is a fix for the regression, because without the EFI changes, Grub2 was
>>>>>> booting successfully Xen. Using grub2 to boot Xen on arm is a very common
>>>>>> solution so not supporting this anymore could lead to lots of people having
>>>>>> issues if they update to Xen 4.16.
>>>>> 
>>>>> I'm not objecting to addressing the issue. But the description needs
>>>>> to make clear where the origin of the problem lies, and afaict that's
>>>>> not the earlier Xen change. That one merely uncovered what, according
>>>>> to your reply, might then be a GrUB bug. Unless, as said earlier, I
>>>>> merely haven't been able to spot provisions in the spec for the field
>>>>> in question to be NULL.
>>>> 
>>>> Maybe I can rephrase to be more specific from:
>>>> 
>>>> The problem comes from the function get_parent_handle(...) that inside
>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>> is NULL, making Xen stop the UEFI boot.
>>>> 
>>>> To:
>>>> 
>>>> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle,
>>>> that is used by efi_bs->HandleProtocol(…) inside get_parent_handle(…),
>>>> causing Xen to stop the boot getting an EFI_INVALID_PARAMETER error.
>>>> 
>>>> Do you think it can be ok like this?
>>> 
>>> Much better, yes, but I wonder what "returning" refers to. You want to
>>> describe the origin of the NULL handle as precisely as possible. And
>>> considering this turns out as a workaround, in a suitable place you
>>> will also want to add a code comment, such that a later reader won't
>>> decide this is all dead code and can be done in a simpler way.
>> 
>> Ok I can write the issue from the beginning to be sure:
>> 
>> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle
>> inside the interface given by the LOADED_IMAGE_PROTOCOL service, this
>> handle is used later by efi_bs->HandleProtocol(…) inside get_parent_handle(…)
>> when requesting the SIMPLE_FILE_SYSTEM_PROTOCOL interface,
>> causing Xen to stop the boot because of an EFI_INVALID_PARAMETER error.
>> 
>> Regarding the comment, I can rephrase this comment:
>> /*
>> * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>> * to have access to the filesystem.
>> */
>> 
>> To be:
>> 
>> /*
>> * If DeviceHandle is NULL, the firmware offering the UEFI services might not be
>> * compliant to the standard and we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>> * to have access to the filesystem. However the system can boot if and only if it doesn’t
>> * require access to the filesystem. (e.g. Xen image has everything built in or the
>> * bootloader did previously load every needed binary in memory)
>> */
>> 
>> What do you think?
> 
> Largely okay, albeit you don't mention GrUB at all (which isn't really part
> of the firmware, but which looks to be the culprit) and it gets a little
> too verbose. Provided the facts have been verified, how about
> 
>    /*
>     * GrUB has been observed to supply a NULL DeviceHandle. We can't use
>     * that to gain access to the filesystem. However the system can still
>     * boot if it doesn’t require access to the filesystem.
>     */
> 
> (and it's up to you whether you include your further "e.g. ..." then, but
> if you do I think the parenthesized part belong before the final full
> stop, and the last "in" would want to be "into")? It's still dubious to me
> how they can get away with such a NULL handle (and why that happens only
> on Arm), but I guess it would go too far to try to understand the
> background.

This might not be a problem in Grub but actually in EDK2 or due to the fact that
EDK2 is starting Grub which is starting Xen. Grub is not modifying this explicitly
from what we found during our investigations.

Cheers
Bertrand

> 
> Jan
Luca Fancellu Nov. 3, 2021, 3:42 p.m. UTC | #14
> On 3 Nov 2021, at 15:30, Jan Beulich <jbeulich@suse.com> wrote:
> 
> On 03.11.2021 16:16, Luca Fancellu wrote:
>> 
>> 
>>> On 3 Nov 2021, at 14:30, Jan Beulich <jbeulich@suse.com> wrote:
>>> 
>>> On 03.11.2021 15:09, Luca Fancellu wrote:
>>>>> On 3 Nov 2021, at 11:28, Jan Beulich <jbeulich@suse.com> wrote:
>>>>> On 03.11.2021 11:20, Luca Fancellu wrote:
>>>>>>> On 3 Nov 2021, at 08:20, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>> On 02.11.2021 18:12, Luca Fancellu wrote:
>>>>>>>>> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>>>> On 02.11.2021 15:05, Luca Fancellu wrote:
>>>>>>>>>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>>>>>>>>>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>>>>>>>>>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>>>>>>>>>> 
>>>>>>>>>> The problem comes from the function get_parent_handle(...) that inside
>>>>>>>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>>>>>>>> is NULL, making Xen stop the UEFI boot.
>>>>>>>>> 
>>>>>>>>> According to my reading the UEFI spec doesn't (explicitly) allow for
>>>>>>>>> this to be NULL. Could you clarify why this is the case? What other
>>>>>>>>> information may end up being invalid / absent? Is e.g. read_section()
>>>>>>>>> safe to use?
>>>>>>>> 
>>>>>>>> My test on an arm machine running Grub2 on top of EDK2 showed that
>>>>>>>> when Xen is started, the get_parent_handle(…) call was failing and stopping
>>>>>>>> the boot because the efi_bs->HandleProtocol(…) was called with the
>>>>>>>> loaded_image->DeviceHandle argument NULL and the call was returning
>>>>>>>> a EFI_INVALID_PARAMETER.
>>>>>>>> So the parent handle can’t be requested and the filesystem can’t be used,
>>>>>>>> but any other code that doesn’t use the handle provided by get_parent_handle(…)
>>>>>>>> can be used without problem like read_section(...).
>>>>>>> 
>>>>>>> I understand this. My question was for the reason of ->DeviceHandle
>>>>>>> being NULL. IOW I'm wondering whether we're actually talking about a
>>>>>>> firmware or GrUB bug, in which case your change is a workaround for
>>>>>>> that rather than (primarily) a fix for the earlier Xen change.
>>>>>> 
>>>>>> The issue was found only when using EDK2+Grub2, no issue when booting
>>>>>> directly from EDK2.
>>>>>> This is a fix for the regression, because without the EFI changes, Grub2 was
>>>>>> booting successfully Xen. Using grub2 to boot Xen on arm is a very common
>>>>>> solution so not supporting this anymore could lead to lots of people having
>>>>>> issues if they update to Xen 4.16.
>>>>> 
>>>>> I'm not objecting to addressing the issue. But the description needs
>>>>> to make clear where the origin of the problem lies, and afaict that's
>>>>> not the earlier Xen change. That one merely uncovered what, according
>>>>> to your reply, might then be a GrUB bug. Unless, as said earlier, I
>>>>> merely haven't been able to spot provisions in the spec for the field
>>>>> in question to be NULL.
>>>> 
>>>> Maybe I can rephrase to be more specific from:
>>>> 
>>>> The problem comes from the function get_parent_handle(...) that inside
>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>> is NULL, making Xen stop the UEFI boot.
>>>> 
>>>> To:
>>>> 
>>>> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle,
>>>> that is used by efi_bs->HandleProtocol(…) inside get_parent_handle(…),
>>>> causing Xen to stop the boot getting an EFI_INVALID_PARAMETER error.
>>>> 
>>>> Do you think it can be ok like this?
>>> 
>>> Much better, yes, but I wonder what "returning" refers to. You want to
>>> describe the origin of the NULL handle as precisely as possible. And
>>> considering this turns out as a workaround, in a suitable place you
>>> will also want to add a code comment, such that a later reader won't
>>> decide this is all dead code and can be done in a simpler way.
>> 
>> Ok I can write the issue from the beginning to be sure:
>> 
>> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle
>> inside the interface given by the LOADED_IMAGE_PROTOCOL service, this
>> handle is used later by efi_bs->HandleProtocol(…) inside get_parent_handle(…)
>> when requesting the SIMPLE_FILE_SYSTEM_PROTOCOL interface,
>> causing Xen to stop the boot because of an EFI_INVALID_PARAMETER error.
>> 
>> Regarding the comment, I can rephrase this comment:
>> /*
>> * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>> * to have access to the filesystem.
>> */
>> 
>> To be:
>> 
>> /*
>> * If DeviceHandle is NULL, the firmware offering the UEFI services might not be
>> * compliant to the standard and we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>> * to have access to the filesystem. However the system can boot if and only if it doesn’t
>> * require access to the filesystem. (e.g. Xen image has everything built in or the
>> * bootloader did previously load every needed binary in memory)
>> */
>> 
>> What do you think?
> 
> Largely okay, albeit you don't mention GrUB at all (which isn't really part
> of the firmware, but which looks to be the culprit) and it gets a little
> too verbose. Provided the facts have been verified, how about
> 
>    /*
>     * GrUB has been observed to supply a NULL DeviceHandle. We can't use
>     * that to gain access to the filesystem. However the system can still
>     * boot if it doesn’t require access to the filesystem.
>     */

Ok yes this is better, I will do the changes to the commit message and the
comment for the v2, I’ll wait Stefano reply to see if I have to include also
his suggestion.

Thank you.

> 
> (and it's up to you whether you include your further "e.g. ..." then, but
> if you do I think the parenthesized part belong before the final full
> stop, and the last "in" would want to be "into")? It's still dubious to me
> how they can get away with such a NULL handle (and why that happens only
> on Arm), but I guess it would go too far to try to understand the
> background.
> 
> Jan
Jan Beulich Nov. 3, 2021, 3:51 p.m. UTC | #15
On 03.11.2021 16:41, Bertrand Marquis wrote:
> Hi Jan,
> 
>> On 3 Nov 2021, at 15:30, Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 03.11.2021 16:16, Luca Fancellu wrote:
>>>
>>>
>>>> On 3 Nov 2021, at 14:30, Jan Beulich <jbeulich@suse.com> wrote:
>>>>
>>>> On 03.11.2021 15:09, Luca Fancellu wrote:
>>>>>> On 3 Nov 2021, at 11:28, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>> On 03.11.2021 11:20, Luca Fancellu wrote:
>>>>>>>> On 3 Nov 2021, at 08:20, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>>> On 02.11.2021 18:12, Luca Fancellu wrote:
>>>>>>>>>> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>>>>> On 02.11.2021 15:05, Luca Fancellu wrote:
>>>>>>>>>>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>>>>>>>>>>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>>>>>>>>>>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>>>>>>>>>>>
>>>>>>>>>>> The problem comes from the function get_parent_handle(...) that inside
>>>>>>>>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>>>>>>>>> is NULL, making Xen stop the UEFI boot.
>>>>>>>>>>
>>>>>>>>>> According to my reading the UEFI spec doesn't (explicitly) allow for
>>>>>>>>>> this to be NULL. Could you clarify why this is the case? What other
>>>>>>>>>> information may end up being invalid / absent? Is e.g. read_section()
>>>>>>>>>> safe to use?
>>>>>>>>>
>>>>>>>>> My test on an arm machine running Grub2 on top of EDK2 showed that
>>>>>>>>> when Xen is started, the get_parent_handle(…) call was failing and stopping
>>>>>>>>> the boot because the efi_bs->HandleProtocol(…) was called with the
>>>>>>>>> loaded_image->DeviceHandle argument NULL and the call was returning
>>>>>>>>> a EFI_INVALID_PARAMETER.
>>>>>>>>> So the parent handle can’t be requested and the filesystem can’t be used,
>>>>>>>>> but any other code that doesn’t use the handle provided by get_parent_handle(…)
>>>>>>>>> can be used without problem like read_section(...).
>>>>>>>>
>>>>>>>> I understand this. My question was for the reason of ->DeviceHandle
>>>>>>>> being NULL. IOW I'm wondering whether we're actually talking about a
>>>>>>>> firmware or GrUB bug, in which case your change is a workaround for
>>>>>>>> that rather than (primarily) a fix for the earlier Xen change.
>>>>>>>
>>>>>>> The issue was found only when using EDK2+Grub2, no issue when booting
>>>>>>> directly from EDK2.
>>>>>>> This is a fix for the regression, because without the EFI changes, Grub2 was
>>>>>>> booting successfully Xen. Using grub2 to boot Xen on arm is a very common
>>>>>>> solution so not supporting this anymore could lead to lots of people having
>>>>>>> issues if they update to Xen 4.16.
>>>>>>
>>>>>> I'm not objecting to addressing the issue. But the description needs
>>>>>> to make clear where the origin of the problem lies, and afaict that's
>>>>>> not the earlier Xen change. That one merely uncovered what, according
>>>>>> to your reply, might then be a GrUB bug. Unless, as said earlier, I
>>>>>> merely haven't been able to spot provisions in the spec for the field
>>>>>> in question to be NULL.
>>>>>
>>>>> Maybe I can rephrase to be more specific from:
>>>>>
>>>>> The problem comes from the function get_parent_handle(...) that inside
>>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>>> is NULL, making Xen stop the UEFI boot.
>>>>>
>>>>> To:
>>>>>
>>>>> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle,
>>>>> that is used by efi_bs->HandleProtocol(…) inside get_parent_handle(…),
>>>>> causing Xen to stop the boot getting an EFI_INVALID_PARAMETER error.
>>>>>
>>>>> Do you think it can be ok like this?
>>>>
>>>> Much better, yes, but I wonder what "returning" refers to. You want to
>>>> describe the origin of the NULL handle as precisely as possible. And
>>>> considering this turns out as a workaround, in a suitable place you
>>>> will also want to add a code comment, such that a later reader won't
>>>> decide this is all dead code and can be done in a simpler way.
>>>
>>> Ok I can write the issue from the beginning to be sure:
>>>
>>> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle
>>> inside the interface given by the LOADED_IMAGE_PROTOCOL service, this
>>> handle is used later by efi_bs->HandleProtocol(…) inside get_parent_handle(…)
>>> when requesting the SIMPLE_FILE_SYSTEM_PROTOCOL interface,
>>> causing Xen to stop the boot because of an EFI_INVALID_PARAMETER error.
>>>
>>> Regarding the comment, I can rephrase this comment:
>>> /*
>>> * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>>> * to have access to the filesystem.
>>> */
>>>
>>> To be:
>>>
>>> /*
>>> * If DeviceHandle is NULL, the firmware offering the UEFI services might not be
>>> * compliant to the standard and we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>>> * to have access to the filesystem. However the system can boot if and only if it doesn’t
>>> * require access to the filesystem. (e.g. Xen image has everything built in or the
>>> * bootloader did previously load every needed binary in memory)
>>> */
>>>
>>> What do you think?
>>
>> Largely okay, albeit you don't mention GrUB at all (which isn't really part
>> of the firmware, but which looks to be the culprit) and it gets a little
>> too verbose. Provided the facts have been verified, how about
>>
>>    /*
>>     * GrUB has been observed to supply a NULL DeviceHandle. We can't use
>>     * that to gain access to the filesystem. However the system can still
>>     * boot if it doesn’t require access to the filesystem.
>>     */
>>
>> (and it's up to you whether you include your further "e.g. ..." then, but
>> if you do I think the parenthesized part belong before the final full
>> stop, and the last "in" would want to be "into")? It's still dubious to me
>> how they can get away with such a NULL handle (and why that happens only
>> on Arm), but I guess it would go too far to try to understand the
>> background.
> 
> This might not be a problem in Grub but actually in EDK2 or due to the fact that
> EDK2 is starting Grub which is starting Xen. Grub is not modifying this explicitly
> from what we found during our investigations.

Otoh Luca said that there's no problem without GrUB. So maybe "GrUB
in combination with EDK2 ..."?

Thinking more about it, this may also be partly related to our
limitation to loading files from file systems (and not e.g. networks
or RAM). Yet even then I couldn't see how a NULL device handle could
be used for anything.

Jan
Luca Fancellu Nov. 3, 2021, 4:07 p.m. UTC | #16
> On 3 Nov 2021, at 15:51, Jan Beulich <jbeulich@suse.com> wrote:
> 
> On 03.11.2021 16:41, Bertrand Marquis wrote:
>> Hi Jan,
>> 
>>> On 3 Nov 2021, at 15:30, Jan Beulich <jbeulich@suse.com> wrote:
>>> 
>>> On 03.11.2021 16:16, Luca Fancellu wrote:
>>>> 
>>>> 
>>>>> On 3 Nov 2021, at 14:30, Jan Beulich <jbeulich@suse.com> wrote:
>>>>> 
>>>>> On 03.11.2021 15:09, Luca Fancellu wrote:
>>>>>>> On 3 Nov 2021, at 11:28, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>> On 03.11.2021 11:20, Luca Fancellu wrote:
>>>>>>>>> On 3 Nov 2021, at 08:20, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>>>> On 02.11.2021 18:12, Luca Fancellu wrote:
>>>>>>>>>>> On 2 Nov 2021, at 14:45, Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>>>>>> On 02.11.2021 15:05, Luca Fancellu wrote:
>>>>>>>>>>>> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
>>>>>>>>>>>> ("arm/efi: Use dom0less configuration when using EFI boot") is
>>>>>>>>>>>> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
>>>>>>>>>>>> 
>>>>>>>>>>>> The problem comes from the function get_parent_handle(...) that inside
>>>>>>>>>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>>>>>>>>>> is NULL, making Xen stop the UEFI boot.
>>>>>>>>>>> 
>>>>>>>>>>> According to my reading the UEFI spec doesn't (explicitly) allow for
>>>>>>>>>>> this to be NULL. Could you clarify why this is the case? What other
>>>>>>>>>>> information may end up being invalid / absent? Is e.g. read_section()
>>>>>>>>>>> safe to use?
>>>>>>>>>> 
>>>>>>>>>> My test on an arm machine running Grub2 on top of EDK2 showed that
>>>>>>>>>> when Xen is started, the get_parent_handle(…) call was failing and stopping
>>>>>>>>>> the boot because the efi_bs->HandleProtocol(…) was called with the
>>>>>>>>>> loaded_image->DeviceHandle argument NULL and the call was returning
>>>>>>>>>> a EFI_INVALID_PARAMETER.
>>>>>>>>>> So the parent handle can’t be requested and the filesystem can’t be used,
>>>>>>>>>> but any other code that doesn’t use the handle provided by get_parent_handle(…)
>>>>>>>>>> can be used without problem like read_section(...).
>>>>>>>>> 
>>>>>>>>> I understand this. My question was for the reason of ->DeviceHandle
>>>>>>>>> being NULL. IOW I'm wondering whether we're actually talking about a
>>>>>>>>> firmware or GrUB bug, in which case your change is a workaround for
>>>>>>>>> that rather than (primarily) a fix for the earlier Xen change.
>>>>>>>> 
>>>>>>>> The issue was found only when using EDK2+Grub2, no issue when booting
>>>>>>>> directly from EDK2.
>>>>>>>> This is a fix for the regression, because without the EFI changes, Grub2 was
>>>>>>>> booting successfully Xen. Using grub2 to boot Xen on arm is a very common
>>>>>>>> solution so not supporting this anymore could lead to lots of people having
>>>>>>>> issues if they update to Xen 4.16.
>>>>>>> 
>>>>>>> I'm not objecting to addressing the issue. But the description needs
>>>>>>> to make clear where the origin of the problem lies, and afaict that's
>>>>>>> not the earlier Xen change. That one merely uncovered what, according
>>>>>>> to your reply, might then be a GrUB bug. Unless, as said earlier, I
>>>>>>> merely haven't been able to spot provisions in the spec for the field
>>>>>>> in question to be NULL.
>>>>>> 
>>>>>> Maybe I can rephrase to be more specific from:
>>>>>> 
>>>>>> The problem comes from the function get_parent_handle(...) that inside
>>>>>> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
>>>>>> is NULL, making Xen stop the UEFI boot.
>>>>>> 
>>>>>> To:
>>>>>> 
>>>>>> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle,
>>>>>> that is used by efi_bs->HandleProtocol(…) inside get_parent_handle(…),
>>>>>> causing Xen to stop the boot getting an EFI_INVALID_PARAMETER error.
>>>>>> 
>>>>>> Do you think it can be ok like this?
>>>>> 
>>>>> Much better, yes, but I wonder what "returning" refers to. You want to
>>>>> describe the origin of the NULL handle as precisely as possible. And
>>>>> considering this turns out as a workaround, in a suitable place you
>>>>> will also want to add a code comment, such that a later reader won't
>>>>> decide this is all dead code and can be done in a simpler way.
>>>> 
>>>> Ok I can write the issue from the beginning to be sure:
>>>> 
>>>> Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle
>>>> inside the interface given by the LOADED_IMAGE_PROTOCOL service, this
>>>> handle is used later by efi_bs->HandleProtocol(…) inside get_parent_handle(…)
>>>> when requesting the SIMPLE_FILE_SYSTEM_PROTOCOL interface,
>>>> causing Xen to stop the boot because of an EFI_INVALID_PARAMETER error.
>>>> 
>>>> Regarding the comment, I can rephrase this comment:
>>>> /*
>>>> * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>>>> * to have access to the filesystem.
>>>> */
>>>> 
>>>> To be:
>>>> 
>>>> /*
>>>> * If DeviceHandle is NULL, the firmware offering the UEFI services might not be
>>>> * compliant to the standard and we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
>>>> * to have access to the filesystem. However the system can boot if and only if it doesn’t
>>>> * require access to the filesystem. (e.g. Xen image has everything built in or the
>>>> * bootloader did previously load every needed binary in memory)
>>>> */
>>>> 
>>>> What do you think?
>>> 
>>> Largely okay, albeit you don't mention GrUB at all (which isn't really part
>>> of the firmware, but which looks to be the culprit) and it gets a little
>>> too verbose. Provided the facts have been verified, how about
>>> 
>>>   /*
>>>    * GrUB has been observed to supply a NULL DeviceHandle. We can't use
>>>    * that to gain access to the filesystem. However the system can still
>>>    * boot if it doesn’t require access to the filesystem.
>>>    */
>>> 
>>> (and it's up to you whether you include your further "e.g. ..." then, but
>>> if you do I think the parenthesized part belong before the final full
>>> stop, and the last "in" would want to be "into")? It's still dubious to me
>>> how they can get away with such a NULL handle (and why that happens only
>>> on Arm), but I guess it would go too far to try to understand the
>>> background.
>> 
>> This might not be a problem in Grub but actually in EDK2 or due to the fact that
>> EDK2 is starting Grub which is starting Xen. Grub is not modifying this explicitly
>> from what we found during our investigations.
> 
> Otoh Luca said that there's no problem without GrUB. So maybe "GrUB
> in combination with EDK2 ..."?

Yes as Bertrand suggested and following your wording is more complete to say:

    /*
     * Grub2 running on top of EDK2 has been observed to supply a NULL
     * DeviceHandle. We can't use that to gain access to the filesystem.
     * However the system can still boot if it doesn’t require access to the
     * filesystem.
     */

> 
> Thinking more about it, this may also be partly related to our
> limitation to loading files from file systems (and not e.g. networks
> or RAM). Yet even then I couldn't see how a NULL device handle could
> be used for anything.
> 
> Jan
Stefano Stabellini Nov. 3, 2021, 10:29 p.m. UTC | #17
On Wed, 3 Nov 2021, Luca Fancellu wrote:
> > On 2 Nov 2021, at 23:17, Stefano Stabellini <sstabellini@kernel.org> wrote:
> > 
> > On Tue, 2 Nov 2021, Luca Fancellu wrote:
> >> The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
> >> ("arm/efi: Use dom0less configuration when using EFI boot") is
> >> introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
> >> 
> >> The problem comes from the function get_parent_handle(...) that inside
> >> uses the HandleProtocol on loaded_image->DeviceHandle, but the last
> >> is NULL, making Xen stop the UEFI boot.
> >> 
> >> Before the commit above, the function was never called because the
> >> logic was skipping the call when there were multiboot modules in the
> >> DT because the filesystem was never used and the bootloader had
> >> put in place all the right modules in memory and the addresses
> >> in the DT.
> >> 
> >> To fix the problem we allow the get_parent_handle(...) function to
> >> return a NULL handle on error and we check the usage of the function
> >> to handle the new use case. The function in fact should not prevent
> >> the boot even if the filesystem can't be used, because the DT and
> >> the modules could be put in place by the bootloader before running
> >> Xen and if xen,uefi-binary property is not used, there is no need
> >> for the filesystem.
> >> 
> >> Another problem is found when the UEFI stub tries to check if Dom0
> >> image or DomUs are present.
> >> The logic doesn't work when the UEFI stub is not responsible to load
> >> any modules, so the efi_check_dt_boot(...) return value is modified
> >> to return the number of multiboot module found and not only the number
> >> of module loaded by the stub.
> >> 
> >> Fixes: a1743fc3a9 ("arm/efi: Use dom0less configuration when using EFI boot")
> >> Signed-off-by: Luca Fancellu <luca.fancellu@arm.com>
> >> ---
> >> Justification for integration in 4.16:
> >> Upside: allow booting xen from grub on arm64 when the stub doesn't load
> >>        any module.
> >> Downside: It's affecting the EFI boot path.
> >> Risk: It's not affecting x86 arch that works the same way as before.
> >>      If something is wrong it creates a problem on early boot and not at
> >>      runtime, so risk is low.
> >> 
> >> Tested in this configurations:
> >> - Bootloader loads modules and specify them as multiboot modules in DT:
> >>   * combination of Dom0, DomUs, Dom0 and DomUs
> >> - DT specifies multiboot modules in DT using xen,uefi-binary property:
> >>   * combination of Dom0, DomUs, Dom0 and DomUs
> >> - Bootloader loads a Dom0 module and appends it as multiboot module in DT,
> >>   other multiboot modules are listed for DomUs using xen,uefi-binary
> >> - No multiboot modules in DT and no kernel entry in cfg file:
> >>   * proper error thrown
> >> 
> >> ---
> >> xen/arch/arm/efi/efi-boot.h | 28 ++++++++++++++++++----------
> >> xen/common/efi/boot.c       | 15 ++++++++++++++-
> >> 2 files changed, 32 insertions(+), 11 deletions(-)
> >> 
> >> diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
> >> index 8b88dd26a5..e714b2b44c 100644
> >> --- a/xen/arch/arm/efi/efi-boot.h
> >> +++ b/xen/arch/arm/efi/efi-boot.h
> >> @@ -51,9 +51,11 @@ static int handle_module_node(EFI_FILE_HANDLE dir_handle,
> >>                               int module_node_offset,
> >>                               int reg_addr_cells,
> >>                               int reg_size_cells,
> >> -                              bool is_domu_module);
> >> +                              bool is_domu_module,
> >> +                              unsigned int *modules_found);
> >> static int handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
> >> -                                       int domain_node);
> >> +                                       int domain_node,
> >> +                                       unsigned int *modules_found);
> >> static int efi_check_dt_boot(EFI_FILE_HANDLE dir_handle);
> >> 
> >> #define DEVICE_TREE_GUID \
> >> @@ -707,7 +709,8 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
> >>                                      int module_node_offset,
> >>                                      int reg_addr_cells,
> >>                                      int reg_size_cells,
> >> -                                     bool is_domu_module)
> >> +                                     bool is_domu_module,
> >> +                                     unsigned int *modules_found)
> >> {
> >>     const void *uefi_name_prop;
> >>     char mod_string[24]; /* Placeholder for module@ + a 64-bit number + \0 */
> >> @@ -725,6 +728,9 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
> >>         /* Module is not a multiboot,module */
> >>         return 0;
> >> 
> >> +    /* Count the multiboot module as found */
> >> +    (*modules_found)++;
> >> +
> >>     /* Read xen,uefi-binary property to get the file name. */
> >>     uefi_name_prop = fdt_getprop(fdt, module_node_offset, "xen,uefi-binary",
> >>                                  &uefi_name_len);
> >> @@ -804,7 +810,8 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
> >>  * Returns 0 on success, negative number on error.
> >>  */
> >> static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
> >> -                                              int domain_node)
> >> +                                              int domain_node,
> >> +                                              unsigned int *modules_found)
> >> {
> >>     int module_node, addr_cells, size_cells, len;
> >>     const struct fdt_property *prop;
> >> @@ -834,7 +841,7 @@ static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
> >>           module_node = fdt_next_subnode(fdt, module_node) )
> >>     {
> >>         int ret = handle_module_node(dir_handle, module_node, addr_cells,
> >> -                                     size_cells, true);
> >> +                                     size_cells, true, modules_found);
> >>         if ( ret < 0 )
> >>             return ret;
> >>     }
> >> @@ -845,12 +852,12 @@ static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
> >> /*
> >>  * This function checks for xen domain nodes under the /chosen node for possible
> >>  * dom0 and domU guests to be loaded.
> >> - * Returns the number of modules loaded or a negative number for error.
> >> + * Returns the number of multiboot modules found or a negative number for error.
> >>  */
> >> static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
> >> {
> >>     int chosen, node, addr_len, size_len;
> >> -    unsigned int i = 0;
> >> +    unsigned int i = 0, modules_found = 0;
> >> 
> >>     /* Check for the chosen node in the current DTB */
> >>     chosen = setup_chosen_node(fdt, &addr_len, &size_len);
> >> @@ -868,11 +875,12 @@ static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
> >>         if ( !fdt_node_check_compatible(fdt, node, "xen,domain") )
> >>         {
> >>             /* Found a node with compatible xen,domain; handle this node. */
> >> -            if ( handle_dom0less_domain_node(dir_handle, node) < 0 )
> >> +            if ( handle_dom0less_domain_node(dir_handle, node,
> >> +                                             &modules_found) < 0 )
> >>                 return ERROR_DT_MODULE_DOMU;
> >>         }
> >>         else if ( handle_module_node(dir_handle, node, addr_len, size_len,
> >> -                                     false) < 0 )
> >> +                                     false, &modules_found) < 0 )
> >>                  return ERROR_DT_MODULE_DOM0;
> > 
> > I think there is no need to add modules_found to the parameters of
> > handle_dom0less_domain_node and handle_module_node. You could just
> > increment modules_found here for every iteration of the loop where
> > there is no error.  You would have to change a couple of returns in
> > handle_module_node, just to give you the idea:
> 
> Yes we could do that but when we handle a xen,domain node we will count
> only one module and that defeats the aim to count every multiboot,module.
> 
> If we want to continue with your proposal let me know and I will implement it.

Not necessarely. We could make the change that both
handle_dom0less_domain_node and handle_module_node return:

- < 0 on error
- 0 on nothing to do
- the number of modules on success (currently this is also 0)

In the dom0less case, if one of the modules returns error, we should
return error as we already do today.


> > diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
> > index e714b2b44c..7739789c41 100644
> > --- a/xen/arch/arm/efi/efi-boot.h
> > +++ b/xen/arch/arm/efi/efi-boot.h
> > @@ -726,7 +726,7 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
> > 
> >     if ( module_compat != 0 )
> >         /* Module is not a multiboot,module */
> > -        return 0;
> > +        return 1;
> > 
> >     /* Count the multiboot module as found */
> >     (*modules_found)++;
> > @@ -737,7 +737,7 @@ static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
> > 
> >     if ( !uefi_name_prop )
> >         /* Property not found */
> > -        return 0;
> > +        return 1;
> > 
> >     file_idx = get_module_file_index(uefi_name_prop, uefi_name_len);
> >     if ( file_idx < 0 )
> > 
> > 
> >>     }
> >> 
> >> @@ -883,7 +891,7 @@ static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
> >>         efi_bs->FreePool(modules[i].name);
> >>     }
> >> 
> >> -    return modules_idx;
> >> +    return modules_found;
> >> }
> >> 
> >> static void __init efi_arch_cpu(void)
> >> diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c
> >> index 392ff3ac9b..495e7a4096 100644
> >> --- a/xen/common/efi/boot.c
> >> +++ b/xen/common/efi/boot.c
> >> @@ -449,6 +449,13 @@ static EFI_FILE_HANDLE __init get_parent_handle(EFI_LOADED_IMAGE *loaded_image,
> >>     CHAR16 *pathend, *ptr;
> >>     EFI_STATUS ret;
> >> 
> >> +    /*
> >> +     * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
> >> +     * to have access to the filesystem.
> >> +     */
> >> +    if ( !loaded_image->DeviceHandle )
> >> +        return NULL;
> >> +
> >>     do {
> >>         EFI_FILE_IO_INTERFACE *fio;
> >> 
> >> @@ -581,6 +588,8 @@ static bool __init read_file(EFI_FILE_HANDLE dir_handle, CHAR16 *name,
> >>     EFI_STATUS ret;
> >>     const CHAR16 *what = NULL;
> >> 
> >> +    if ( !dir_handle )
> >> +        blexit(L"Error: No access to the filesystem");
> >>     if ( !name )
> >>         PrintErrMesg(L"No filename", EFI_OUT_OF_RESOURCES);
> >>     ret = dir_handle->Open(dir_handle, &FileHandle, name,
> >> @@ -1333,6 +1342,9 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
> >>             EFI_FILE_HANDLE handle = get_parent_handle(loaded_image,
> >>                                                        &file_name);
> >> 
> >> +            if ( !handle )
> >> +                blexit(L"Error retrieving image name: no filesystem access");
> > 
> > I think it would be nice to have an other explicit check like this one
> > at the beginning of if ( use_cfg_file ) to make sure dir_handle is not
> > null in that case. If I am not mistaken, if we take the use_cfg_file
> > path, dir_handle has to be valid, right?
> 
> Dir_handle could be invalid and we would be able to boot successfully when we take everywhere
> the path using read_section, for that reason I didn’t stop the boot earlier.
> Given Jan suggestion that check could be also modified to be something like “if there is no handle, *argv=“xen.efi” "
> so the boot can continue without problem if we don’t need to read anything from the filesystem.

OK
diff mbox series

Patch

diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
index 8b88dd26a5..e714b2b44c 100644
--- a/xen/arch/arm/efi/efi-boot.h
+++ b/xen/arch/arm/efi/efi-boot.h
@@ -51,9 +51,11 @@  static int handle_module_node(EFI_FILE_HANDLE dir_handle,
                               int module_node_offset,
                               int reg_addr_cells,
                               int reg_size_cells,
-                              bool is_domu_module);
+                              bool is_domu_module,
+                              unsigned int *modules_found);
 static int handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
-                                       int domain_node);
+                                       int domain_node,
+                                       unsigned int *modules_found);
 static int efi_check_dt_boot(EFI_FILE_HANDLE dir_handle);
 
 #define DEVICE_TREE_GUID \
@@ -707,7 +709,8 @@  static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
                                      int module_node_offset,
                                      int reg_addr_cells,
                                      int reg_size_cells,
-                                     bool is_domu_module)
+                                     bool is_domu_module,
+                                     unsigned int *modules_found)
 {
     const void *uefi_name_prop;
     char mod_string[24]; /* Placeholder for module@ + a 64-bit number + \0 */
@@ -725,6 +728,9 @@  static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
         /* Module is not a multiboot,module */
         return 0;
 
+    /* Count the multiboot module as found */
+    (*modules_found)++;
+
     /* Read xen,uefi-binary property to get the file name. */
     uefi_name_prop = fdt_getprop(fdt, module_node_offset, "xen,uefi-binary",
                                  &uefi_name_len);
@@ -804,7 +810,8 @@  static int __init handle_module_node(EFI_FILE_HANDLE dir_handle,
  * Returns 0 on success, negative number on error.
  */
 static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
-                                              int domain_node)
+                                              int domain_node,
+                                              unsigned int *modules_found)
 {
     int module_node, addr_cells, size_cells, len;
     const struct fdt_property *prop;
@@ -834,7 +841,7 @@  static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
           module_node = fdt_next_subnode(fdt, module_node) )
     {
         int ret = handle_module_node(dir_handle, module_node, addr_cells,
-                                     size_cells, true);
+                                     size_cells, true, modules_found);
         if ( ret < 0 )
             return ret;
     }
@@ -845,12 +852,12 @@  static int __init handle_dom0less_domain_node(EFI_FILE_HANDLE dir_handle,
 /*
  * This function checks for xen domain nodes under the /chosen node for possible
  * dom0 and domU guests to be loaded.
- * Returns the number of modules loaded or a negative number for error.
+ * Returns the number of multiboot modules found or a negative number for error.
  */
 static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
 {
     int chosen, node, addr_len, size_len;
-    unsigned int i = 0;
+    unsigned int i = 0, modules_found = 0;
 
     /* Check for the chosen node in the current DTB */
     chosen = setup_chosen_node(fdt, &addr_len, &size_len);
@@ -868,11 +875,12 @@  static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
         if ( !fdt_node_check_compatible(fdt, node, "xen,domain") )
         {
             /* Found a node with compatible xen,domain; handle this node. */
-            if ( handle_dom0less_domain_node(dir_handle, node) < 0 )
+            if ( handle_dom0less_domain_node(dir_handle, node,
+                                             &modules_found) < 0 )
                 return ERROR_DT_MODULE_DOMU;
         }
         else if ( handle_module_node(dir_handle, node, addr_len, size_len,
-                                     false) < 0 )
+                                     false, &modules_found) < 0 )
                  return ERROR_DT_MODULE_DOM0;
     }
 
@@ -883,7 +891,7 @@  static int __init efi_check_dt_boot(EFI_FILE_HANDLE dir_handle)
         efi_bs->FreePool(modules[i].name);
     }
 
-    return modules_idx;
+    return modules_found;
 }
 
 static void __init efi_arch_cpu(void)
diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c
index 392ff3ac9b..495e7a4096 100644
--- a/xen/common/efi/boot.c
+++ b/xen/common/efi/boot.c
@@ -449,6 +449,13 @@  static EFI_FILE_HANDLE __init get_parent_handle(EFI_LOADED_IMAGE *loaded_image,
     CHAR16 *pathend, *ptr;
     EFI_STATUS ret;
 
+    /*
+     * If DeviceHandle is NULL, we can't use the SIMPLE_FILE_SYSTEM_PROTOCOL
+     * to have access to the filesystem.
+     */
+    if ( !loaded_image->DeviceHandle )
+        return NULL;
+
     do {
         EFI_FILE_IO_INTERFACE *fio;
 
@@ -581,6 +588,8 @@  static bool __init read_file(EFI_FILE_HANDLE dir_handle, CHAR16 *name,
     EFI_STATUS ret;
     const CHAR16 *what = NULL;
 
+    if ( !dir_handle )
+        blexit(L"Error: No access to the filesystem");
     if ( !name )
         PrintErrMesg(L"No filename", EFI_OUT_OF_RESOURCES);
     ret = dir_handle->Open(dir_handle, &FileHandle, name,
@@ -1333,6 +1342,9 @@  efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
             EFI_FILE_HANDLE handle = get_parent_handle(loaded_image,
                                                        &file_name);
 
+            if ( !handle )
+                blexit(L"Error retrieving image name: no filesystem access");
+
             handle->Close(handle);
             *argv = file_name;
         }
@@ -1369,7 +1381,8 @@  efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
     /* Get the number of boot modules specified on the DT or an error (<0) */
     dt_modules_found = efi_check_dt_boot(dir_handle);
 
-    dir_handle->Close(dir_handle);
+    if ( dir_handle )
+        dir_handle->Close(dir_handle);
 
     if ( dt_modules_found < 0 )
         /* efi_check_dt_boot throws some error */