diff mbox series

riscv: head: use 0 as the default text_offset

Message ID 20221128152442.3403-1-jszhang@kernel.org (mailing list archive)
State Rejected
Headers show
Series riscv: head: use 0 as the default text_offset | expand

Checks

Context Check Description
conchuod/patch_count success Link
conchuod/cover_letter success Single patches do not need cover letters
conchuod/tree_selection success Guessed tree name to be for-next
conchuod/fixes_present success Fixes tag not required for -next series
conchuod/verify_signedoff success Signed-off-by tag matches author and committer
conchuod/kdoc success Errors and warnings before: 0 this patch: 0
conchuod/module_param success Was 0 now: 0
conchuod/build_rv32_defconfig success Build OK
conchuod/build_warn_rv64 success Errors and warnings before: 0 this patch: 0
conchuod/dtb_warn_rv64 success Errors and warnings before: 0 this patch: 0
conchuod/header_inline success No static functions without inline keyword in header files
conchuod/checkpatch success total: 0 errors, 0 warnings, 0 checks, 19 lines checked
conchuod/source_inline success Was 0 now: 0
conchuod/build_rv64_nommu_k210_defconfig success Build OK
conchuod/verify_fixes success No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig success Build OK

Commit Message

Jisheng Zhang Nov. 28, 2022, 3:24 p.m. UTC
Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
parse.") adds an image header which "is based on ARM64 boot image
header and provides an opportunity to combine both ARM64 & RISC-V
image headers in future.". At that time, arm64's default text_offset
is 0x80000, this is to give "512 KB of guaranteed BSS space to put
the swapper page tables" as commit cfa7ede20f13 ("arm64: set TEXT_OFFSET
to 0x0 in preparation for removing it entirely") pointed out, but
riscv doesn't need the space, so use 0 as the default text_offset.

Before this patch, booting linux kernel on Sipeed bl808 M1s Dock
with u-boot booti cmd:
[    0.000000] OF: fdt: Ignoring memory range 0x50000000 - 0x50200000
...
[    0.000000]   DMA32    [mem 0x0000000050200000-0x0000000053ffffff]
As can be seen, 2MB DDR(0x50000000 - 0x501fffff) can't be used by
linux.

After this patch, the 64MB DDR is fully usable by linux
[    0.000000]   DMA32    [mem 0x0000000050000000-0x0000000053ffffff]

Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
---
 arch/riscv/kernel/head.S | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

Comments

Andreas Schwab Nov. 28, 2022, 4:03 p.m. UTC | #1
On Nov 28 2022, Jisheng Zhang wrote:

> Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
> parse.") adds an image header which "is based on ARM64 boot image
> header and provides an opportunity to combine both ARM64 & RISC-V
> image headers in future.". At that time, arm64's default text_offset
> is 0x80000, this is to give "512 KB of guaranteed BSS space to put
> the swapper page tables" as commit cfa7ede20f13 ("arm64: set TEXT_OFFSET
> to 0x0 in preparation for removing it entirely") pointed out, but
> riscv doesn't need the space, so use 0 as the default text_offset.

Doesn't that clash with the memory reserved for openSBI?
Atish Kumar Patra Nov. 28, 2022, 8:11 p.m. UTC | #2
On Mon, Nov 28, 2022 at 7:34 AM Jisheng Zhang <jszhang@kernel.org> wrote:
>
> Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
> parse.") adds an image header which "is based on ARM64 boot image
> header and provides an opportunity to combine both ARM64 & RISC-V
> image headers in future.". At that time, arm64's default text_offset
> is 0x80000, this is to give "512 KB of guaranteed BSS space to put
> the swapper page tables" as commit cfa7ede20f13 ("arm64: set TEXT_OFFSET
> to 0x0 in preparation for removing it entirely") pointed out, but
> riscv doesn't need the space, so use 0 as the default text_offset.
>
> Before this patch, booting linux kernel on Sipeed bl808 M1s Dock
> with u-boot booti cmd:
> [    0.000000] OF: fdt: Ignoring memory range 0x50000000 - 0x50200000
> ...
> [    0.000000]   DMA32    [mem 0x0000000050200000-0x0000000053ffffff]
> As can be seen, 2MB DDR(0x50000000 - 0x501fffff) can't be used by
> linux.
>
> After this patch, the 64MB DDR is fully usable by linux
> [    0.000000]   DMA32    [mem 0x0000000050000000-0x0000000053ffffff]
>
> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
> ---
>  arch/riscv/kernel/head.S | 12 +-----------
>  1 file changed, 1 insertion(+), 11 deletions(-)
>
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index b865046e4dbb..ef95943f7a70 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -38,18 +38,8 @@ ENTRY(_start)
>         .word 0
>  #endif
>         .balign 8
> -#ifdef CONFIG_RISCV_M_MODE
> -       /* Image load offset (0MB) from start of RAM for M-mode */
> +       /* Image load offset (0MB) from start of RAM */
>         .dword 0
> -#else
> -#if __riscv_xlen == 64
> -       /* Image load offset(2MB) from start of RAM */
> -       .dword 0x200000
> -#else
> -       /* Image load offset(4MB) from start of RAM */
> -       .dword 0x400000
> -#endif
> -#endif

NACK.
RV64 needs to boot at a 2MB aligned address and RV32 needs to boot at
a 4MB aligned address.
The firmware is assumed to live at the start of DRAM for Linux running
in S-mode.


>         /* Effective size of kernel image */
>         .dword _end - _start
>         .dword __HEAD_FLAGS
> --
> 2.37.2
>
Samuel Holland Nov. 29, 2022, 5:04 a.m. UTC | #3
On 11/28/22 14:11, Atish Kumar Patra wrote:
> On Mon, Nov 28, 2022 at 7:34 AM Jisheng Zhang <jszhang@kernel.org> wrote:
>>
>> Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
>> parse.") adds an image header which "is based on ARM64 boot image
>> header and provides an opportunity to combine both ARM64 & RISC-V
>> image headers in future.". At that time, arm64's default text_offset
>> is 0x80000, this is to give "512 KB of guaranteed BSS space to put
>> the swapper page tables" as commit cfa7ede20f13 ("arm64: set TEXT_OFFSET
>> to 0x0 in preparation for removing it entirely") pointed out, but
>> riscv doesn't need the space, so use 0 as the default text_offset.
>>
>> Before this patch, booting linux kernel on Sipeed bl808 M1s Dock
>> with u-boot booti cmd:
>> [    0.000000] OF: fdt: Ignoring memory range 0x50000000 - 0x50200000
>> ...
>> [    0.000000]   DMA32    [mem 0x0000000050200000-0x0000000053ffffff]
>> As can be seen, 2MB DDR(0x50000000 - 0x501fffff) can't be used by
>> linux.
>>
>> After this patch, the 64MB DDR is fully usable by linux
>> [    0.000000]   DMA32    [mem 0x0000000050000000-0x0000000053ffffff]
>>
>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
>> ---
>>  arch/riscv/kernel/head.S | 12 +-----------
>>  1 file changed, 1 insertion(+), 11 deletions(-)
>>
>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>> index b865046e4dbb..ef95943f7a70 100644
>> --- a/arch/riscv/kernel/head.S
>> +++ b/arch/riscv/kernel/head.S
>> @@ -38,18 +38,8 @@ ENTRY(_start)
>>         .word 0
>>  #endif
>>         .balign 8
>> -#ifdef CONFIG_RISCV_M_MODE
>> -       /* Image load offset (0MB) from start of RAM for M-mode */
>> +       /* Image load offset (0MB) from start of RAM */
>>         .dword 0
>> -#else
>> -#if __riscv_xlen == 64
>> -       /* Image load offset(2MB) from start of RAM */
>> -       .dword 0x200000
>> -#else
>> -       /* Image load offset(4MB) from start of RAM */
>> -       .dword 0x400000
>> -#endif
>> -#endif
> 
> NACK.
> RV64 needs to boot at a 2MB aligned address and RV32 needs to boot at
> a 4MB aligned address.
> The firmware is assumed to live at the start of DRAM for Linux running
> in S-mode.

What needs to happen so we can stop making this assumption? If the SBI
implementation wants to reserve memory, it should use the devicetree to
do so. OpenSBI already does this.

Throwing away 2 MiB of RAM is quite wasteful considering we have
multiple SoCs (D1s, BL808) that are limited to 64 MiB of in-package RAM.

Regards,
Samuel
Alexandre Ghiti Nov. 29, 2022, 5:33 a.m. UTC | #4
On 11/29/22 06:04, Samuel Holland wrote:
> On 11/28/22 14:11, Atish Kumar Patra wrote:
>> On Mon, Nov 28, 2022 at 7:34 AM Jisheng Zhang <jszhang@kernel.org> wrote:
>>> Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
>>> parse.") adds an image header which "is based on ARM64 boot image
>>> header and provides an opportunity to combine both ARM64 & RISC-V
>>> image headers in future.". At that time, arm64's default text_offset
>>> is 0x80000, this is to give "512 KB of guaranteed BSS space to put
>>> the swapper page tables" as commit cfa7ede20f13 ("arm64: set TEXT_OFFSET
>>> to 0x0 in preparation for removing it entirely") pointed out, but
>>> riscv doesn't need the space, so use 0 as the default text_offset.
>>>
>>> Before this patch, booting linux kernel on Sipeed bl808 M1s Dock
>>> with u-boot booti cmd:
>>> [    0.000000] OF: fdt: Ignoring memory range 0x50000000 - 0x50200000
>>> ...
>>> [    0.000000]   DMA32    [mem 0x0000000050200000-0x0000000053ffffff]
>>> As can be seen, 2MB DDR(0x50000000 - 0x501fffff) can't be used by
>>> linux.
>>>
>>> After this patch, the 64MB DDR is fully usable by linux
>>> [    0.000000]   DMA32    [mem 0x0000000050000000-0x0000000053ffffff]
>>>
>>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
>>> ---
>>>   arch/riscv/kernel/head.S | 12 +-----------
>>>   1 file changed, 1 insertion(+), 11 deletions(-)
>>>
>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>> index b865046e4dbb..ef95943f7a70 100644
>>> --- a/arch/riscv/kernel/head.S
>>> +++ b/arch/riscv/kernel/head.S
>>> @@ -38,18 +38,8 @@ ENTRY(_start)
>>>          .word 0
>>>   #endif
>>>          .balign 8
>>> -#ifdef CONFIG_RISCV_M_MODE
>>> -       /* Image load offset (0MB) from start of RAM for M-mode */
>>> +       /* Image load offset (0MB) from start of RAM */
>>>          .dword 0
>>> -#else
>>> -#if __riscv_xlen == 64
>>> -       /* Image load offset(2MB) from start of RAM */
>>> -       .dword 0x200000
>>> -#else
>>> -       /* Image load offset(4MB) from start of RAM */
>>> -       .dword 0x400000
>>> -#endif
>>> -#endif
>> NACK.
>> RV64 needs to boot at a 2MB aligned address and RV32 needs to boot at
>> a 4MB aligned address.
>> The firmware is assumed to live at the start of DRAM for Linux running
>> in S-mode.
> What needs to happen so we can stop making this assumption? If the SBI
> implementation wants to reserve memory, it should use the devicetree to
> do so. OpenSBI already does this.
>
> Throwing away 2 MiB of RAM is quite wasteful considering we have
> multiple SoCs (D1s, BL808) that are limited to 64 MiB of in-package RAM.


Actually OpenSBI needs less than 2MB of memory (512KB IIRC), and its 
size is marked in the DT. But in practice, it is the memblock subsystem 
that removes any memory below the 2MB offset because __pa translation is 
"invalid" before setup_vm_final. So we can retrieve 1.5MB by simply 
redefining MIN_MEMBLOCK_ADDR, have a look at 
https://patchwork.kernel.org/project/linux-riscv/patch/20221122084141.1849421-1-alexghiti@rivosinc.com/


>
> Regards,
> Samuel
>
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Palmer Dabbelt Nov. 29, 2022, 6:19 a.m. UTC | #5
On Mon, 28 Nov 2022 21:04:48 PST (-0800), samuel@sholland.org wrote:
> On 11/28/22 14:11, Atish Kumar Patra wrote:
>> On Mon, Nov 28, 2022 at 7:34 AM Jisheng Zhang <jszhang@kernel.org> wrote:
>>>
>>> Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
>>> parse.") adds an image header which "is based on ARM64 boot image
>>> header and provides an opportunity to combine both ARM64 & RISC-V
>>> image headers in future.". At that time, arm64's default text_offset
>>> is 0x80000, this is to give "512 KB of guaranteed BSS space to put
>>> the swapper page tables" as commit cfa7ede20f13 ("arm64: set TEXT_OFFSET
>>> to 0x0 in preparation for removing it entirely") pointed out, but
>>> riscv doesn't need the space, so use 0 as the default text_offset.
>>>
>>> Before this patch, booting linux kernel on Sipeed bl808 M1s Dock
>>> with u-boot booti cmd:
>>> [    0.000000] OF: fdt: Ignoring memory range 0x50000000 - 0x50200000
>>> ...
>>> [    0.000000]   DMA32    [mem 0x0000000050200000-0x0000000053ffffff]
>>> As can be seen, 2MB DDR(0x50000000 - 0x501fffff) can't be used by
>>> linux.
>>>
>>> After this patch, the 64MB DDR is fully usable by linux
>>> [    0.000000]   DMA32    [mem 0x0000000050000000-0x0000000053ffffff]
>>>
>>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
>>> ---
>>>  arch/riscv/kernel/head.S | 12 +-----------
>>>  1 file changed, 1 insertion(+), 11 deletions(-)
>>>
>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>> index b865046e4dbb..ef95943f7a70 100644
>>> --- a/arch/riscv/kernel/head.S
>>> +++ b/arch/riscv/kernel/head.S
>>> @@ -38,18 +38,8 @@ ENTRY(_start)
>>>         .word 0
>>>  #endif
>>>         .balign 8
>>> -#ifdef CONFIG_RISCV_M_MODE
>>> -       /* Image load offset (0MB) from start of RAM for M-mode */
>>> +       /* Image load offset (0MB) from start of RAM */
>>>         .dword 0
>>> -#else
>>> -#if __riscv_xlen == 64
>>> -       /* Image load offset(2MB) from start of RAM */
>>> -       .dword 0x200000
>>> -#else
>>> -       /* Image load offset(4MB) from start of RAM */
>>> -       .dword 0x400000
>>> -#endif
>>> -#endif
>>
>> NACK.
>> RV64 needs to boot at a 2MB aligned address and RV32 needs to boot at
>> a 4MB aligned address.
>> The firmware is assumed to live at the start of DRAM for Linux running
>> in S-mode.
>
> What needs to happen so we can stop making this assumption? If the SBI
> implementation wants to reserve memory, it should use the devicetree to
> do so. OpenSBI already does this.

IMO we've really screwed up the boot flow on RISC-V.  Having Linux 
reserve space for the firmware is just all backwards, Linux can't know 
how much memory the firmware needs (which manifests under large hart 
counts in OpenSBI, for example).  Unfortunately there's no specification 
that defines these platform-level details, so we're stuck depending on  
unspecified behavior like this.

I think we could fix this by either making Linux's early boot relocation 
code work sanely (fix whatever bugs are there, document what can't be 
fixed, and then add some sort of Image flag to tell firmware the kernel 
can be relocated) or relying on relocatable firmware, but both of those 
come with some costs ...

> Throwing away 2 MiB of RAM is quite wasteful considering we have
> multiple SoCs (D1s, BL808) that are limited to 64 MiB of in-package RAM.

... and I'd argue that users on systems don't want to pay those costs.  
In fact, I'd argue that systems like that don't want resident firmware 
at all.

So let's just add a CONFIG_SBI=n, and then just use direct drivers for 
everything.  If the firmware doesn't need to be resident then it's 
pretty straight-forward to support these 0 offsets, so we can just add 
that as another Kconfig.  Sure this will trip up firmware that depends 
on these fixed reservations, but saying "the resident firmware fits in 0 
superpages" is just as much of a platform-specific dependency as saying 
"the resident firmware fits in 1 superpage".  If firmware can't handle 
this field in the Image format then we're going to end up with breakages 
at some point, it might as well be now.

If these systems don't have all the ISA bits necessary to avoid M-mode 
entirely then we can just implement a tiny M-mode stub in Linux that 
gets left around during early boot and then shims stuff to S-mode.  
That'll be a bit of a headache and with some extensions it can be 
avoided, the standard stuff won't allow for that until the latest round 
of specs is done but if it's possible via whatever custom extensions are 
in these things then that's probably the way to go.

We'll probably also end up shaking out some bugs in that early 
relocation code, but again no big deal: we'll need to chase those down 
at some point, it might as well be now.
Samuel Holland Nov. 29, 2022, 6:55 a.m. UTC | #6
On 11/29/22 00:19, Palmer Dabbelt wrote:
> On Mon, 28 Nov 2022 21:04:48 PST (-0800), samuel@sholland.org wrote:
>> On 11/28/22 14:11, Atish Kumar Patra wrote:
>>> On Mon, Nov 28, 2022 at 7:34 AM Jisheng Zhang <jszhang@kernel.org>
>>> wrote:
>>>>
>>>> Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
>>>> parse.") adds an image header which "is based on ARM64 boot image
>>>> header and provides an opportunity to combine both ARM64 & RISC-V
>>>> image headers in future.". At that time, arm64's default text_offset
>>>> is 0x80000, this is to give "512 KB of guaranteed BSS space to put
>>>> the swapper page tables" as commit cfa7ede20f13 ("arm64: set
>>>> TEXT_OFFSET
>>>> to 0x0 in preparation for removing it entirely") pointed out, but
>>>> riscv doesn't need the space, so use 0 as the default text_offset.
>>>>
>>>> Before this patch, booting linux kernel on Sipeed bl808 M1s Dock
>>>> with u-boot booti cmd:
>>>> [    0.000000] OF: fdt: Ignoring memory range 0x50000000 - 0x50200000
>>>> ...
>>>> [    0.000000]   DMA32    [mem 0x0000000050200000-0x0000000053ffffff]
>>>> As can be seen, 2MB DDR(0x50000000 - 0x501fffff) can't be used by
>>>> linux.
>>>>
>>>> After this patch, the 64MB DDR is fully usable by linux
>>>> [    0.000000]   DMA32    [mem 0x0000000050000000-0x0000000053ffffff]
>>>>
>>>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
>>>> ---
>>>>  arch/riscv/kernel/head.S | 12 +-----------
>>>>  1 file changed, 1 insertion(+), 11 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>> index b865046e4dbb..ef95943f7a70 100644
>>>> --- a/arch/riscv/kernel/head.S
>>>> +++ b/arch/riscv/kernel/head.S
>>>> @@ -38,18 +38,8 @@ ENTRY(_start)
>>>>         .word 0
>>>>  #endif
>>>>         .balign 8
>>>> -#ifdef CONFIG_RISCV_M_MODE
>>>> -       /* Image load offset (0MB) from start of RAM for M-mode */
>>>> +       /* Image load offset (0MB) from start of RAM */
>>>>         .dword 0
>>>> -#else
>>>> -#if __riscv_xlen == 64
>>>> -       /* Image load offset(2MB) from start of RAM */
>>>> -       .dword 0x200000
>>>> -#else
>>>> -       /* Image load offset(4MB) from start of RAM */
>>>> -       .dword 0x400000
>>>> -#endif
>>>> -#endif
>>>
>>> NACK.
>>> RV64 needs to boot at a 2MB aligned address and RV32 needs to boot at
>>> a 4MB aligned address.
>>> The firmware is assumed to live at the start of DRAM for Linux running
>>> in S-mode.
>>
>> What needs to happen so we can stop making this assumption? If the SBI
>> implementation wants to reserve memory, it should use the devicetree to
>> do so. OpenSBI already does this.
> 
> IMO we've really screwed up the boot flow on RISC-V.  Having Linux
> reserve space for the firmware is just all backwards, Linux can't know
> how much memory the firmware needs (which manifests under large hart
> counts in OpenSBI, for example).  Unfortunately there's no specification
> that defines these platform-level details, so we're stuck depending on 
> unspecified behavior like this.
> 
> I think we could fix this by either making Linux's early boot relocation
> code work sanely (fix whatever bugs are there, document what can't be
> fixed, and then add some sort of Image flag to tell firmware the kernel
> can be relocated) or relying on relocatable firmware, but both of those
> come with some costs ...

It sounds like Alexandre's patch[1] lets us use memory below this
offset, so we don't have to relocate the kernel to the beginning of RAM.
In fact, we could even increase the offset if we are concerned about the
kernel link address conflicting with the SBI implementation.

[1]:
https://patchwork.kernel.org/project/linux-riscv/patch/20221122084141.1849421-1-alexghiti@rivosinc.com/

>> Throwing away 2 MiB of RAM is quite wasteful considering we have
>> multiple SoCs (D1s, BL808) that are limited to 64 MiB of in-package RAM.
> 
> ... and I'd argue that users on systems don't want to pay those costs. 

What does fixing the early relocation code cost? Just longer boot time?
If the bootloader takes care of avoiding reserved-memory regions, and
Linux can run from wherever it gets loaded, that would be ideal to me.

> In fact, I'd argue that systems like that don't want resident firmware
> at all.

I would much rather pay 256 KiB for resident firmware than reimplement
all of the power management and PMU logic in Linux. It's not as bad as
losing 2 MiB when I know most of that is unused.

> So let's just add a CONFIG_SBI=n, and then just use direct drivers for
> everything.  If the firmware doesn't need to be resident then it's
> pretty straight-forward to support these 0 offsets, so we can just add
> that as another Kconfig.  Sure this will trip up firmware that depends
> on these fixed reservations, but saying "the resident firmware fits in 0
> superpages" is just as much of a platform-specific dependency as saying
> "the resident firmware fits in 1 superpage".  If firmware can't handle
> this field in the Image format then we're going to end up with breakages
> at some point, it might as well be now.
>
> If these systems don't have all the ISA bits necessary to avoid M-mode
> entirely then we can just implement a tiny M-mode stub in Linux that
> gets left around during early boot and then shims stuff to S-mode. 
> That'll be a bit of a headache and with some extensions it can be
> avoided, the standard stuff won't allow for that until the latest round
> of specs is done but if it's possible via whatever custom extensions are
> in these things then that's probably the way to go.

I don't think Linux has a choice here, when started in S-mode. And
neither does the bootloader parsing the Image, because it most likely
runs in S-mode as well.

And when started in M-mode, we already don't use SBI.

Regards,
Samuel
Atish Kumar Patra Nov. 29, 2022, 9:15 a.m. UTC | #7
On Mon, Nov 28, 2022 at 10:55 PM Samuel Holland <samuel@sholland.org> wrote:
>
> On 11/29/22 00:19, Palmer Dabbelt wrote:
> > On Mon, 28 Nov 2022 21:04:48 PST (-0800), samuel@sholland.org wrote:
> >> On 11/28/22 14:11, Atish Kumar Patra wrote:
> >>> On Mon, Nov 28, 2022 at 7:34 AM Jisheng Zhang <jszhang@kernel.org>
> >>> wrote:
> >>>>
> >>>> Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
> >>>> parse.") adds an image header which "is based on ARM64 boot image
> >>>> header and provides an opportunity to combine both ARM64 & RISC-V
> >>>> image headers in future.". At that time, arm64's default text_offset
> >>>> is 0x80000, this is to give "512 KB of guaranteed BSS space to put
> >>>> the swapper page tables" as commit cfa7ede20f13 ("arm64: set
> >>>> TEXT_OFFSET
> >>>> to 0x0 in preparation for removing it entirely") pointed out, but
> >>>> riscv doesn't need the space, so use 0 as the default text_offset.
> >>>>
> >>>> Before this patch, booting linux kernel on Sipeed bl808 M1s Dock
> >>>> with u-boot booti cmd:
> >>>> [    0.000000] OF: fdt: Ignoring memory range 0x50000000 - 0x50200000
> >>>> ...
> >>>> [    0.000000]   DMA32    [mem 0x0000000050200000-0x0000000053ffffff]
> >>>> As can be seen, 2MB DDR(0x50000000 - 0x501fffff) can't be used by
> >>>> linux.
> >>>>
> >>>> After this patch, the 64MB DDR is fully usable by linux
> >>>> [    0.000000]   DMA32    [mem 0x0000000050000000-0x0000000053ffffff]
> >>>>
> >>>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
> >>>> ---
> >>>>  arch/riscv/kernel/head.S | 12 +-----------
> >>>>  1 file changed, 1 insertion(+), 11 deletions(-)
> >>>>
> >>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> >>>> index b865046e4dbb..ef95943f7a70 100644
> >>>> --- a/arch/riscv/kernel/head.S
> >>>> +++ b/arch/riscv/kernel/head.S
> >>>> @@ -38,18 +38,8 @@ ENTRY(_start)
> >>>>         .word 0
> >>>>  #endif
> >>>>         .balign 8
> >>>> -#ifdef CONFIG_RISCV_M_MODE
> >>>> -       /* Image load offset (0MB) from start of RAM for M-mode */
> >>>> +       /* Image load offset (0MB) from start of RAM */
> >>>>         .dword 0
> >>>> -#else
> >>>> -#if __riscv_xlen == 64
> >>>> -       /* Image load offset(2MB) from start of RAM */
> >>>> -       .dword 0x200000
> >>>> -#else
> >>>> -       /* Image load offset(4MB) from start of RAM */
> >>>> -       .dword 0x400000
> >>>> -#endif
> >>>> -#endif
> >>>
> >>> NACK.
> >>> RV64 needs to boot at a 2MB aligned address and RV32 needs to boot at
> >>> a 4MB aligned address.
> >>> The firmware is assumed to live at the start of DRAM for Linux running
> >>> in S-mode.
> >>
> >> What needs to happen so we can stop making this assumption? If the SBI
> >> implementation wants to reserve memory, it should use the devicetree to
> >> do so. OpenSBI already does this.
> >
> > IMO we've really screwed up the boot flow on RISC-V.  Having Linux
> > reserve space for the firmware is just all backwards, Linux can't know
> > how much memory the firmware needs (which manifests under large hart
> > counts in OpenSBI, for example).  Unfortunately there's no specification
> > that defines these platform-level details, so we're stuck depending on
> > unspecified behavior like this.
> >
> > I think we could fix this by either making Linux's early boot relocation
> > code work sanely (fix whatever bugs are there, document what can't be
> > fixed, and then add some sort of Image flag to tell firmware the kernel
> > can be relocated) or relying on relocatable firmware, but both of those
> > come with some costs ...
>
> It sounds like Alexandre's patch[1] lets us use memory below this
> offset, so we don't have to relocate the kernel to the beginning of RAM.
> In fact, we could even increase the offset if we are concerned about the
> kernel link address conflicting with the SBI implementation.
>
> [1]:
> https://patchwork.kernel.org/project/linux-riscv/patch/20221122084141.1849421-1-alexghiti@rivosinc.com/
>
> >> Throwing away 2 MiB of RAM is quite wasteful considering we have
> >> multiple SoCs (D1s, BL808) that are limited to 64 MiB of in-package RAM.
> >
> > ... and I'd argue that users on systems don't want to pay those costs.
>
> What does fixing the early relocation code cost? Just longer boot time?
> If the bootloader takes care of avoiding reserved-memory regions, and
> Linux can run from wherever it gets loaded, that would be ideal to me.
>
> > In fact, I'd argue that systems like that don't want resident firmware
> > at all.
>
> I would much rather pay 256 KiB for resident firmware than reimplement
> all of the power management and PMU logic in Linux. It's not as bad as
> losing 2 MiB when I know most of that is unused.
>

There are also debug triggers, AP-TEE which are SBI extensions.
In addition to that we have steal time accounting (STA) SBI extension
in virtualization
use cases as well.

Note: The PMU requirement will probably no longer be if supervisor
counter delegation
extension is approved. But it will take some time for hardware to
actually implement that.

> > So let's just add a CONFIG_SBI=n, and then just use direct drivers for
> > everything.  If the firmware doesn't need to be resident then it's
> > pretty straight-forward to support these 0 offsets, so we can just add
> > that as another Kconfig.  Sure this will trip up firmware that depends
> > on these fixed reservations, but saying "the resident firmware fits in 0
> > superpages" is just as much of a platform-specific dependency as saying
> > "the resident firmware fits in 1 superpage".  If firmware can't handle
> > this field in the Image format then we're going to end up with breakages
> > at some point, it might as well be now.
> >
> > If these systems don't have all the ISA bits necessary to avoid M-mode
> > entirely then we can just implement a tiny M-mode stub in Linux that
> > gets left around during early boot and then shims stuff to S-mode.
> > That'll be a bit of a headache and with some extensions it can be
> > avoided, the standard stuff won't allow for that until the latest round
> > of specs is done but if it's possible via whatever custom extensions are
> > in these things then that's probably the way to go.
>
> I don't think Linux has a choice here, when started in S-mode. And
> neither does the bootloader parsing the Image, because it most likely
> runs in S-mode as well.
>
> And when started in M-mode, we already don't use SBI.
>
> Regards,
> Samuel
>
Anup Patel Nov. 29, 2022, 9:18 a.m. UTC | #8
On Tue, Nov 29, 2022 at 11:50 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Mon, 28 Nov 2022 21:04:48 PST (-0800), samuel@sholland.org wrote:
> > On 11/28/22 14:11, Atish Kumar Patra wrote:
> >> On Mon, Nov 28, 2022 at 7:34 AM Jisheng Zhang <jszhang@kernel.org> wrote:
> >>>
> >>> Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
> >>> parse.") adds an image header which "is based on ARM64 boot image
> >>> header and provides an opportunity to combine both ARM64 & RISC-V
> >>> image headers in future.". At that time, arm64's default text_offset
> >>> is 0x80000, this is to give "512 KB of guaranteed BSS space to put
> >>> the swapper page tables" as commit cfa7ede20f13 ("arm64: set TEXT_OFFSET
> >>> to 0x0 in preparation for removing it entirely") pointed out, but
> >>> riscv doesn't need the space, so use 0 as the default text_offset.
> >>>
> >>> Before this patch, booting linux kernel on Sipeed bl808 M1s Dock
> >>> with u-boot booti cmd:
> >>> [    0.000000] OF: fdt: Ignoring memory range 0x50000000 - 0x50200000
> >>> ...
> >>> [    0.000000]   DMA32    [mem 0x0000000050200000-0x0000000053ffffff]
> >>> As can be seen, 2MB DDR(0x50000000 - 0x501fffff) can't be used by
> >>> linux.
> >>>
> >>> After this patch, the 64MB DDR is fully usable by linux
> >>> [    0.000000]   DMA32    [mem 0x0000000050000000-0x0000000053ffffff]
> >>>
> >>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
> >>> ---
> >>>  arch/riscv/kernel/head.S | 12 +-----------
> >>>  1 file changed, 1 insertion(+), 11 deletions(-)
> >>>
> >>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> >>> index b865046e4dbb..ef95943f7a70 100644
> >>> --- a/arch/riscv/kernel/head.S
> >>> +++ b/arch/riscv/kernel/head.S
> >>> @@ -38,18 +38,8 @@ ENTRY(_start)
> >>>         .word 0
> >>>  #endif
> >>>         .balign 8
> >>> -#ifdef CONFIG_RISCV_M_MODE
> >>> -       /* Image load offset (0MB) from start of RAM for M-mode */
> >>> +       /* Image load offset (0MB) from start of RAM */
> >>>         .dword 0
> >>> -#else
> >>> -#if __riscv_xlen == 64
> >>> -       /* Image load offset(2MB) from start of RAM */
> >>> -       .dword 0x200000
> >>> -#else
> >>> -       /* Image load offset(4MB) from start of RAM */
> >>> -       .dword 0x400000
> >>> -#endif
> >>> -#endif
> >>
> >> NACK.
> >> RV64 needs to boot at a 2MB aligned address and RV32 needs to boot at
> >> a 4MB aligned address.
> >> The firmware is assumed to live at the start of DRAM for Linux running
> >> in S-mode.
> >
> > What needs to happen so we can stop making this assumption? If the SBI
> > implementation wants to reserve memory, it should use the devicetree to
> > do so. OpenSBI already does this.
>
> IMO we've really screwed up the boot flow on RISC-V.  Having Linux
> reserve space for the firmware is just all backwards, Linux can't know
> how much memory the firmware needs (which manifests under large hart
> counts in OpenSBI, for example).  Unfortunately there's no specification
> that defines these platform-level details, so we're stuck depending on
> unspecified behavior like this.
>
> I think we could fix this by either making Linux's early boot relocation
> code work sanely (fix whatever bugs are there, document what can't be
> fixed, and then add some sort of Image flag to tell firmware the kernel
> can be relocated) or relying on relocatable firmware, but both of those
> come with some costs ...

Clearly, there is a misunderstanding about the "offset" field in the kernel
image header.

The "offset" field is not for reserving memory for firmware (OpenSBI). The
value of "offset" is a hint to bootloader that kernel expects to be booted
at a particular aligned physical address (2M for RV64 and 4M for RV32).
This "offset" hint is based on the fact that the kernel tries its best to use
huge mappings for kernel virtual addresses which results in good
performance in kernel space.

In fact, even today a bootloader can happily load the kernel at the
RAM_START and it will boot perfectly fine.

The real issue is that whenever __pa(PAGE_OFFSET) > RAM_START,
kernel will reserve memory from RAM_START to __pa(PAGE_OFFSET)
and this will happen irrespective where the firmware is running. The
patch posted by Alex tries to solve this issue whereas the current patch
tries to change the "offset" in the image header which is totally wrong.

Also, the Linux reserved-memory DT bindings define a standard way
to inform the kernel about reserved memory areas. The OpenSBI
firmware uses reserved-memory DT bindings to reserve its own
memory so kernel does not have to guess anything.

>
> > Throwing away 2 MiB of RAM is quite wasteful considering we have
> > multiple SoCs (D1s, BL808) that are limited to 64 MiB of in-package RAM.
>
> ... and I'd argue that users on systems don't want to pay those costs.
> In fact, I'd argue that systems like that don't want resident firmware
> at all.
>
> So let's just add a CONFIG_SBI=n, and then just use direct drivers for
> everything.  If the firmware doesn't need to be resident then it's
> pretty straight-forward to support these 0 offsets, so we can just add
> that as another Kconfig.  Sure this will trip up firmware that depends
> on these fixed reservations, but saying "the resident firmware fits in 0
> superpages" is just as much of a platform-specific dependency as saying
> "the resident firmware fits in 1 superpage".  If firmware can't handle
> this field in the Image format then we're going to end up with breakages
> at some point, it might as well be now.

The firmware (OpenSBI) does not depend on any fixed memory
reservations from the kernel. It is capable of running from any
memory location since it is compiled using -fPIC.

>
> If these systems don't have all the ISA bits necessary to avoid M-mode
> entirely then we can just implement a tiny M-mode stub in Linux that
> gets left around during early boot and then shims stuff to S-mode.
> That'll be a bit of a headache and with some extensions it can be
> avoided, the standard stuff won't allow for that until the latest round
> of specs is done but if it's possible via whatever custom extensions are
> in these things then that's probably the way to go.

Another misunderstanding here is that SBI stuff == firmware stuff. All
RISC-V hypervisors implement the SBI spec for Guest/VM. In absence
of SBI, how will Guest/VM bring-up secondary VCPUs or reboot itself ?

In fact without SBI, all paravirt approaches (such as steal time
accounting, time scaling, system reset, CPU hotplug etc) become
fragmented across hypervisors. Do we expect all hypervisors (proprietary
or opensource) send their own non-standard driver support to Linux
which is otherwise standardized through SBI spec ?

The M-mode can have dedicated resources (debug CSRs, PMU
counter configuration CSRs, etc) so in absence of SBI how will
S-mode use these features? We have more M-mode only resources
being defined by RVI (example Resumable NMIs) which will eventually
require SBI abstraction for S-mode.

Let alone CSRs, on future platforms, the M-mode will have attestation
capability so without SBI how will S-mode get this attestation facility.

>
> We'll probably also end up shaking out some bugs in that early
> relocation code, but again no big deal: we'll need to chase those down
> at some point, it might as well be now.

Regards,
Anup
Jisheng Zhang Nov. 29, 2022, 3:07 p.m. UTC | #9
On Tue, Nov 29, 2022 at 02:48:08PM +0530, Anup Patel wrote:
> On Tue, Nov 29, 2022 at 11:50 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> >
> > On Mon, 28 Nov 2022 21:04:48 PST (-0800), samuel@sholland.org wrote:
> > > On 11/28/22 14:11, Atish Kumar Patra wrote:
> > >> On Mon, Nov 28, 2022 at 7:34 AM Jisheng Zhang <jszhang@kernel.org> wrote:
> > >>>
> > >>> Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
> > >>> parse.") adds an image header which "is based on ARM64 boot image
> > >>> header and provides an opportunity to combine both ARM64 & RISC-V
> > >>> image headers in future.". At that time, arm64's default text_offset
> > >>> is 0x80000, this is to give "512 KB of guaranteed BSS space to put
> > >>> the swapper page tables" as commit cfa7ede20f13 ("arm64: set TEXT_OFFSET
> > >>> to 0x0 in preparation for removing it entirely") pointed out, but
> > >>> riscv doesn't need the space, so use 0 as the default text_offset.
> > >>>
> > >>> Before this patch, booting linux kernel on Sipeed bl808 M1s Dock
> > >>> with u-boot booti cmd:
> > >>> [    0.000000] OF: fdt: Ignoring memory range 0x50000000 - 0x50200000
> > >>> ...
> > >>> [    0.000000]   DMA32    [mem 0x0000000050200000-0x0000000053ffffff]
> > >>> As can be seen, 2MB DDR(0x50000000 - 0x501fffff) can't be used by
> > >>> linux.
> > >>>
> > >>> After this patch, the 64MB DDR is fully usable by linux
> > >>> [    0.000000]   DMA32    [mem 0x0000000050000000-0x0000000053ffffff]
> > >>>
> > >>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
> > >>> ---
> > >>>  arch/riscv/kernel/head.S | 12 +-----------
> > >>>  1 file changed, 1 insertion(+), 11 deletions(-)
> > >>>
> > >>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> > >>> index b865046e4dbb..ef95943f7a70 100644
> > >>> --- a/arch/riscv/kernel/head.S
> > >>> +++ b/arch/riscv/kernel/head.S
> > >>> @@ -38,18 +38,8 @@ ENTRY(_start)
> > >>>         .word 0
> > >>>  #endif
> > >>>         .balign 8
> > >>> -#ifdef CONFIG_RISCV_M_MODE
> > >>> -       /* Image load offset (0MB) from start of RAM for M-mode */
> > >>> +       /* Image load offset (0MB) from start of RAM */
> > >>>         .dword 0
> > >>> -#else
> > >>> -#if __riscv_xlen == 64
> > >>> -       /* Image load offset(2MB) from start of RAM */
> > >>> -       .dword 0x200000
> > >>> -#else
> > >>> -       /* Image load offset(4MB) from start of RAM */
> > >>> -       .dword 0x400000
> > >>> -#endif
> > >>> -#endif
> > >>
> > >> NACK.
> > >> RV64 needs to boot at a 2MB aligned address and RV32 needs to boot at
> > >> a 4MB aligned address.
> > >> The firmware is assumed to live at the start of DRAM for Linux running
> > >> in S-mode.
> > >
> > > What needs to happen so we can stop making this assumption? If the SBI
> > > implementation wants to reserve memory, it should use the devicetree to
> > > do so. OpenSBI already does this.
> >
> > IMO we've really screwed up the boot flow on RISC-V.  Having Linux
> > reserve space for the firmware is just all backwards, Linux can't know
> > how much memory the firmware needs (which manifests under large hart
> > counts in OpenSBI, for example).  Unfortunately there's no specification
> > that defines these platform-level details, so we're stuck depending on
> > unspecified behavior like this.
> >
> > I think we could fix this by either making Linux's early boot relocation
> > code work sanely (fix whatever bugs are there, document what can't be
> > fixed, and then add some sort of Image flag to tell firmware the kernel
> > can be relocated) or relying on relocatable firmware, but both of those
> > come with some costs ...
> 
> Clearly, there is a misunderstanding about the "offset" field in the kernel
> image header.
> 
> The "offset" field is not for reserving memory for firmware (OpenSBI). The

This is also my understanding -- offset != reserve memory for firmware.

> value of "offset" is a hint to bootloader that kernel expects to be booted
> at a particular aligned physical address (2M for RV64 and 4M for RV32).
> This "offset" hint is based on the fact that the kernel tries its best to use
> huge mappings for kernel virtual addresses which results in good
> performance in kernel space.

IIUC, the "offset" is used to tell bootloader to load the kernel to
linux_start_ram + offset, whether align to 2MB or not isn't the target.
u-boot's booti_setup() clearly indicates this usage.
So if linux_start_ram isn't 2MB aligned, linux_start_ram + offset isn't
2MB aligned either. IMHO, if we do care about performance, it's the
bootloader's job to chose a suitable linux_start_ram.

And the reason arm64 used text_offset is to ensure space to put the
swapper page tables, after v4.6, text_offset in arm64 is 0 as well.

> 
> In fact, even today a bootloader can happily load the kernel at the
> RAM_START and it will boot perfectly fine.
> 
> The real issue is that whenever __pa(PAGE_OFFSET) > RAM_START,
> kernel will reserve memory from RAM_START to __pa(PAGE_OFFSET)
> and this will happen irrespective where the firmware is running. The
> patch posted by Alex tries to solve this issue whereas the current patch
> tries to change the "offset" in the image header which is totally wrong.
> 
> Also, the Linux reserved-memory DT bindings define a standard way
> to inform the kernel about reserved memory areas. The OpenSBI
> firmware uses reserved-memory DT bindings to reserve its own
> memory so kernel does not have to guess anything.
> 
> >
> > > Throwing away 2 MiB of RAM is quite wasteful considering we have
> > > multiple SoCs (D1s, BL808) that are limited to 64 MiB of in-package RAM.
> >
> > ... and I'd argue that users on systems don't want to pay those costs.
> > In fact, I'd argue that systems like that don't want resident firmware
> > at all.
> >
> > So let's just add a CONFIG_SBI=n, and then just use direct drivers for
> > everything.  If the firmware doesn't need to be resident then it's
> > pretty straight-forward to support these 0 offsets, so we can just add
> > that as another Kconfig.  Sure this will trip up firmware that depends
> > on these fixed reservations, but saying "the resident firmware fits in 0
> > superpages" is just as much of a platform-specific dependency as saying
> > "the resident firmware fits in 1 superpage".  If firmware can't handle
> > this field in the Image format then we're going to end up with breakages
> > at some point, it might as well be now.
> 
> The firmware (OpenSBI) does not depend on any fixed memory
> reservations from the kernel. It is capable of running from any
> memory location since it is compiled using -fPIC.
> 
> >
> > If these systems don't have all the ISA bits necessary to avoid M-mode
> > entirely then we can just implement a tiny M-mode stub in Linux that
> > gets left around during early boot and then shims stuff to S-mode.
> > That'll be a bit of a headache and with some extensions it can be
> > avoided, the standard stuff won't allow for that until the latest round
> > of specs is done but if it's possible via whatever custom extensions are
> > in these things then that's probably the way to go.
> 
> Another misunderstanding here is that SBI stuff == firmware stuff. All
> RISC-V hypervisors implement the SBI spec for Guest/VM. In absence
> of SBI, how will Guest/VM bring-up secondary VCPUs or reboot itself ?
> 
> In fact without SBI, all paravirt approaches (such as steal time
> accounting, time scaling, system reset, CPU hotplug etc) become
> fragmented across hypervisors. Do we expect all hypervisors (proprietary
> or opensource) send their own non-standard driver support to Linux
> which is otherwise standardized through SBI spec ?
> 
> The M-mode can have dedicated resources (debug CSRs, PMU
> counter configuration CSRs, etc) so in absence of SBI how will
> S-mode use these features? We have more M-mode only resources
> being defined by RVI (example Resumable NMIs) which will eventually
> require SBI abstraction for S-mode.
> 
> Let alone CSRs, on future platforms, the M-mode will have attestation
> capability so without SBI how will S-mode get this attestation facility.
> 
> >
> > We'll probably also end up shaking out some bugs in that early
> > relocation code, but again no big deal: we'll need to chase those down
> > at some point, it might as well be now.
> 
> Regards,
> Anup
Palmer Dabbelt Nov. 29, 2022, 6:33 p.m. UTC | #10
On Mon, 28 Nov 2022 22:55:24 PST (-0800), samuel@sholland.org wrote:
> On 11/29/22 00:19, Palmer Dabbelt wrote:
>> On Mon, 28 Nov 2022 21:04:48 PST (-0800), samuel@sholland.org wrote:
>>> On 11/28/22 14:11, Atish Kumar Patra wrote:
>>>> On Mon, Nov 28, 2022 at 7:34 AM Jisheng Zhang <jszhang@kernel.org>
>>>> wrote:
>>>>>
>>>>> Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
>>>>> parse.") adds an image header which "is based on ARM64 boot image
>>>>> header and provides an opportunity to combine both ARM64 & RISC-V
>>>>> image headers in future.". At that time, arm64's default text_offset
>>>>> is 0x80000, this is to give "512 KB of guaranteed BSS space to put
>>>>> the swapper page tables" as commit cfa7ede20f13 ("arm64: set
>>>>> TEXT_OFFSET
>>>>> to 0x0 in preparation for removing it entirely") pointed out, but
>>>>> riscv doesn't need the space, so use 0 as the default text_offset.
>>>>>
>>>>> Before this patch, booting linux kernel on Sipeed bl808 M1s Dock
>>>>> with u-boot booti cmd:
>>>>> [    0.000000] OF: fdt: Ignoring memory range 0x50000000 - 0x50200000
>>>>> ...
>>>>> [    0.000000]   DMA32    [mem 0x0000000050200000-0x0000000053ffffff]
>>>>> As can be seen, 2MB DDR(0x50000000 - 0x501fffff) can't be used by
>>>>> linux.
>>>>>
>>>>> After this patch, the 64MB DDR is fully usable by linux
>>>>> [    0.000000]   DMA32    [mem 0x0000000050000000-0x0000000053ffffff]
>>>>>
>>>>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
>>>>> ---
>>>>>  arch/riscv/kernel/head.S | 12 +-----------
>>>>>  1 file changed, 1 insertion(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>>> index b865046e4dbb..ef95943f7a70 100644
>>>>> --- a/arch/riscv/kernel/head.S
>>>>> +++ b/arch/riscv/kernel/head.S
>>>>> @@ -38,18 +38,8 @@ ENTRY(_start)
>>>>>         .word 0
>>>>>  #endif
>>>>>         .balign 8
>>>>> -#ifdef CONFIG_RISCV_M_MODE
>>>>> -       /* Image load offset (0MB) from start of RAM for M-mode */
>>>>> +       /* Image load offset (0MB) from start of RAM */
>>>>>         .dword 0
>>>>> -#else
>>>>> -#if __riscv_xlen == 64
>>>>> -       /* Image load offset(2MB) from start of RAM */
>>>>> -       .dword 0x200000
>>>>> -#else
>>>>> -       /* Image load offset(4MB) from start of RAM */
>>>>> -       .dword 0x400000
>>>>> -#endif
>>>>> -#endif
>>>>
>>>> NACK.
>>>> RV64 needs to boot at a 2MB aligned address and RV32 needs to boot at
>>>> a 4MB aligned address.
>>>> The firmware is assumed to live at the start of DRAM for Linux running
>>>> in S-mode.
>>>
>>> What needs to happen so we can stop making this assumption? If the SBI
>>> implementation wants to reserve memory, it should use the devicetree to
>>> do so. OpenSBI already does this.
>>
>> IMO we've really screwed up the boot flow on RISC-V.  Having Linux
>> reserve space for the firmware is just all backwards, Linux can't know
>> how much memory the firmware needs (which manifests under large hart
>> counts in OpenSBI, for example).  Unfortunately there's no specification
>> that defines these platform-level details, so we're stuck depending on 
>> unspecified behavior like this.
>>
>> I think we could fix this by either making Linux's early boot relocation
>> code work sanely (fix whatever bugs are there, document what can't be
>> fixed, and then add some sort of Image flag to tell firmware the kernel
>> can be relocated) or relying on relocatable firmware, but both of those
>> come with some costs ...
>
> It sounds like Alexandre's patch[1] lets us use memory below this
> offset, so we don't have to relocate the kernel to the beginning of RAM.
> In fact, we could even increase the offset if we are concerned about the
> kernel link address conflicting with the SBI implementation.
>
> [1]:
> https://patchwork.kernel.org/project/linux-riscv/patch/20221122084141.1849421-1-alexghiti@rivosinc.com/
>
>>> Throwing away 2 MiB of RAM is quite wasteful considering we have
>>> multiple SoCs (D1s, BL808) that are limited to 64 MiB of in-package RAM.
>>
>> ... and I'd argue that users on systems don't want to pay those costs. 
>
> What does fixing the early relocation code cost? Just longer boot time?
> If the bootloader takes care of avoiding reserved-memory regions, and
> Linux can run from wherever it gets loaded, that would be ideal to me.

There's a runtime cost to either the relocatable kernel or firmware, 
but if those VA tricks are enough to make it work then it's pretty much 
free in terms of performance -- I think we'd even be able to avoid 
copying the kernel image around if the bootloader aligns it properly?

>> In fact, I'd argue that systems like that don't want resident firmware
>> at all.
>
> I would much rather pay 256 KiB for resident firmware than reimplement
> all of the power management and PMU logic in Linux. It's not as bad as
> losing 2 MiB when I know most of that is unused.

OK, sounds like just having the firmware accurately report the memory it 
needs to Linux and then reclaiming whatever's left in that 2MiB region 
is sufficient here?  That's way easier.

>> So let's just add a CONFIG_SBI=n, and then just use direct drivers for
>> everything.  If the firmware doesn't need to be resident then it's
>> pretty straight-forward to support these 0 offsets, so we can just add
>> that as another Kconfig.  Sure this will trip up firmware that depends
>> on these fixed reservations, but saying "the resident firmware fits in 0
>> superpages" is just as much of a platform-specific dependency as saying
>> "the resident firmware fits in 1 superpage".  If firmware can't handle
>> this field in the Image format then we're going to end up with breakages
>> at some point, it might as well be now.
>>
>> If these systems don't have all the ISA bits necessary to avoid M-mode
>> entirely then we can just implement a tiny M-mode stub in Linux that
>> gets left around during early boot and then shims stuff to S-mode. 
>> That'll be a bit of a headache and with some extensions it can be
>> avoided, the standard stuff won't allow for that until the latest round
>> of specs is done but if it's possible via whatever custom extensions are
>> in these things then that's probably the way to go.
>
> I don't think Linux has a choice here, when started in S-mode. And
> neither does the bootloader parsing the Image, because it most likely
> runs in S-mode as well.

I think there's some options here, but I guess we can leave that up to 
someone who wants the feature ;)

> And when started in M-mode, we already don't use SBI.
>
> Regards,
> Samuel
diff mbox series

Patch

diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index b865046e4dbb..ef95943f7a70 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -38,18 +38,8 @@  ENTRY(_start)
 	.word 0
 #endif
 	.balign 8
-#ifdef CONFIG_RISCV_M_MODE
-	/* Image load offset (0MB) from start of RAM for M-mode */
+	/* Image load offset (0MB) from start of RAM */
 	.dword 0
-#else
-#if __riscv_xlen == 64
-	/* Image load offset(2MB) from start of RAM */
-	.dword 0x200000
-#else
-	/* Image load offset(4MB) from start of RAM */
-	.dword 0x400000
-#endif
-#endif
 	/* Effective size of kernel image */
 	.dword _end - _start
 	.dword __HEAD_FLAGS