diff mbox series

x86/hvm/dom0: fix PVH initrd and metadata placement

Message ID 20231026064543.43369-1-xenia.ragiadakou@amd.com (mailing list archive)
State Superseded
Headers show
Series x86/hvm/dom0: fix PVH initrd and metadata placement | expand

Commit Message

Xenia Ragiadakou Oct. 26, 2023, 6:45 a.m. UTC
Given that start < kernel_end and end > kernel_start, the logic that
determines the best placement for dom0 initrd and metadata, does not
take into account the two cases below:
(1) start > kernel_start && end > kernel_end
(2) start < kernel_start && end < kernel_end

In case (1), the evaluation will result in end = kernel_start
i.e. end < start, and will load initrd in the middle of the kernel.
In case (2), the evaluation will result in start = kernel_end
i.e. end < start, and will load initrd at kernel_end, that is out
of the memory region under evaluation.

This patch rephrases the if condition to include the above two cases
without affecting the behaviour for the case where
start < kernel_start && end > kernel_end

Fixes: 73b47eea2104 ('x86/dom0: improve PVH initrd and metadata placement')
Signed-off-by: Xenia Ragiadakou <xenia.ragiadakou@amd.com>
---
 xen/arch/x86/hvm/dom0_build.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jan Beulich Oct. 26, 2023, 7:35 a.m. UTC | #1
On 26.10.2023 08:45, Xenia Ragiadakou wrote:
> Given that start < kernel_end and end > kernel_start, the logic that
> determines the best placement for dom0 initrd and metadata, does not
> take into account the two cases below:
> (1) start > kernel_start && end > kernel_end
> (2) start < kernel_start && end < kernel_end
> 
> In case (1), the evaluation will result in end = kernel_start
> i.e. end < start, and will load initrd in the middle of the kernel.
> In case (2), the evaluation will result in start = kernel_end
> i.e. end < start, and will load initrd at kernel_end, that is out
> of the memory region under evaluation.

I agree there is a problem if the kernel range overlaps but is not fully
contained in the E820 range under inspection. I'd like to ask though
under what conditions that can happen, as it seems suspicious for the
kernel range to span multiple E820 ranges.

> This patch rephrases the if condition to include the above two cases
> without affecting the behaviour for the case where
> start < kernel_start && end > kernel_end
> 
> Fixes: 73b47eea2104 ('x86/dom0: improve PVH initrd and metadata placement')
> Signed-off-by: Xenia Ragiadakou <xenia.ragiadakou@amd.com>
> ---
>  xen/arch/x86/hvm/dom0_build.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> index c7d47d0d4c..5fc2c12f3a 100644
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>          if ( end <= kernel_start || start >= kernel_end )
>              ; /* No overlap, nothing to do. */
>          /* Deal with the kernel already being loaded in the region. */
> -        else if ( kernel_start - start > end - kernel_end )
> +        else if ( kernel_start + kernel_end > start + end )

What meaning has the sum of the start and end of either range? I can't
figure how comparing those two values will be generally correct / useful.
If the partial-overlap case needs handling in the first place, I think
new conditionals need adding (and the existing one needs constraining to
"kernel range fully contained") to use
- as before, the larger of the non-overlapping ranges at start and end
  if the kernel range is fully contained,
- the tail of the range when the overlap is at the start,
- the head of the range when the overlap is at the end.

That said, in the "kernel range fully contained" case it may want
considering to use the tail range if it is large enough, rather than
the larger of the two ranges. In fact when switching to that model, we
ought to be able to get away with one less conditional, as then the
"kernel range fully contained" case doesn't need treating specially.

Jan
Xenia Ragiadakou Oct. 26, 2023, 8:34 a.m. UTC | #2
On 26/10/23 10:35, Jan Beulich wrote:
> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>> Given that start < kernel_end and end > kernel_start, the logic that
>> determines the best placement for dom0 initrd and metadata, does not
>> take into account the two cases below:
>> (1) start > kernel_start && end > kernel_end
>> (2) start < kernel_start && end < kernel_end
>>
>> In case (1), the evaluation will result in end = kernel_start
>> i.e. end < start, and will load initrd in the middle of the kernel.
>> In case (2), the evaluation will result in start = kernel_end
>> i.e. end < start, and will load initrd at kernel_end, that is out
>> of the memory region under evaluation.
> I agree there is a problem if the kernel range overlaps but is not fully
> contained in the E820 range under inspection. I'd like to ask though
> under what conditions that can happen, as it seems suspicious for the
> kernel range to span multiple E820 ranges.

We tried to boot Zephyr as pvh dom0 and its load address was under 1MB.

I know ... that maybe shouldn't have been permitted at all, but 
nevertheless we hit this issue.

>> This patch rephrases the if condition to include the above two cases
>> without affecting the behaviour for the case where
>> start < kernel_start && end > kernel_end
>>
>> Fixes: 73b47eea2104 ('x86/dom0: improve PVH initrd and metadata placement')
>> Signed-off-by: Xenia Ragiadakou<xenia.ragiadakou@amd.com>
>> ---
>>   xen/arch/x86/hvm/dom0_build.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
>> index c7d47d0d4c..5fc2c12f3a 100644
>> --- a/xen/arch/x86/hvm/dom0_build.c
>> +++ b/xen/arch/x86/hvm/dom0_build.c
>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>           if ( end <= kernel_start || start >= kernel_end )
>>               ; /* No overlap, nothing to do. */
>>           /* Deal with the kernel already being loaded in the region. */
>> -        else if ( kernel_start - start > end - kernel_end )
>> +        else if ( kernel_start + kernel_end > start + end )
> What meaning has the sum of the start and end of either range? I can't
> figure how comparing those two values will be generally correct / useful.
> If the partial-overlap case needs handling in the first place, I think
> new conditionals need adding (and the existing one needs constraining to
> "kernel range fully contained") to use
> - as before, the larger of the non-overlapping ranges at start and end
>    if the kernel range is fully contained,
> - the tail of the range when the overlap is at the start,
> - the head of the range when the overlap is at the end.

Yes it is not quite straight forward to understand and is based on the
assumption that end > kernel_start and start < kernel_end, due to
the first condition failing.

Both cases:
(start < kernel_start && end < kernel_end) and
(kernel_start - start > end - kernel_end)
fall into the condition ( kernel_start + kernel_end > start + end )

And both the cases:
(start > kernel_start && end > kernel_end) and
(end - kernel_end > kernel_start - start)
fall into the condition ( kernel_start + kernel_end < start + end )

... unless of course I miss a case

> That said, in the "kernel range fully contained" case it may want
> considering to use the tail range if it is large enough, rather than
> the larger of the two ranges. In fact when switching to that model, we
> ought to be able to get away with one less conditional, as then the
> "kernel range fully contained" case doesn't need treating specially.
>
> Jan
Jan Beulich Oct. 26, 2023, 8:45 a.m. UTC | #3
On 26.10.2023 10:34, Xenia Ragiadakou wrote:
> On 26/10/23 10:35, Jan Beulich wrote:
>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>> Given that start < kernel_end and end > kernel_start, the logic that
>>> determines the best placement for dom0 initrd and metadata, does not
>>> take into account the two cases below:
>>> (1) start > kernel_start && end > kernel_end
>>> (2) start < kernel_start && end < kernel_end
>>>
>>> In case (1), the evaluation will result in end = kernel_start
>>> i.e. end < start, and will load initrd in the middle of the kernel.
>>> In case (2), the evaluation will result in start = kernel_end
>>> i.e. end < start, and will load initrd at kernel_end, that is out
>>> of the memory region under evaluation.
>> I agree there is a problem if the kernel range overlaps but is not fully
>> contained in the E820 range under inspection. I'd like to ask though
>> under what conditions that can happen, as it seems suspicious for the
>> kernel range to span multiple E820 ranges.
> 
> We tried to boot Zephyr as pvh dom0 and its load address was under 1MB.
> 
> I know ... that maybe shouldn't have been permitted at all, but 
> nevertheless we hit this issue.

How can they expect to have a contiguous, large enough range there?

>>> This patch rephrases the if condition to include the above two cases
>>> without affecting the behaviour for the case where
>>> start < kernel_start && end > kernel_end
>>>
>>> Fixes: 73b47eea2104 ('x86/dom0: improve PVH initrd and metadata placement')
>>> Signed-off-by: Xenia Ragiadakou<xenia.ragiadakou@amd.com>
>>> ---
>>>   xen/arch/x86/hvm/dom0_build.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
>>> index c7d47d0d4c..5fc2c12f3a 100644
>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>           if ( end <= kernel_start || start >= kernel_end )
>>>               ; /* No overlap, nothing to do. */
>>>           /* Deal with the kernel already being loaded in the region. */
>>> -        else if ( kernel_start - start > end - kernel_end )
>>> +        else if ( kernel_start + kernel_end > start + end )
>> What meaning has the sum of the start and end of either range? I can't
>> figure how comparing those two values will be generally correct / useful.
>> If the partial-overlap case needs handling in the first place, I think
>> new conditionals need adding (and the existing one needs constraining to
>> "kernel range fully contained") to use
>> - as before, the larger of the non-overlapping ranges at start and end
>>    if the kernel range is fully contained,
>> - the tail of the range when the overlap is at the start,
>> - the head of the range when the overlap is at the end.
> 
> Yes it is not quite straight forward to understand and is based on the
> assumption that end > kernel_start and start < kernel_end, due to
> the first condition failing.
> 
> Both cases:
> (start < kernel_start && end < kernel_end) and
> (kernel_start - start > end - kernel_end)
> fall into the condition ( kernel_start + kernel_end > start + end )
> 
> And both the cases:
> (start > kernel_start && end > kernel_end) and
> (end - kernel_end > kernel_start - start)
> fall into the condition ( kernel_start + kernel_end < start + end )
> 
> ... unless of course I miss a case

Well, mathematically (i.e. ignoring the potential for overflow) the
original expression and your replacement are identical anyway. But
overflow needs to be taken into consideration, and hence there is a
(theoretical only at this point) risk with the replacement expression
as well. As a result I still think that ...

>> That said, in the "kernel range fully contained" case it may want
>> considering to use the tail range if it is large enough, rather than
>> the larger of the two ranges. In fact when switching to that model, we
>> ought to be able to get away with one less conditional, as then the
>> "kernel range fully contained" case doesn't need treating specially.

... this alternative approach may want considering (provided we need
to make a change in the first place, which I continue to be
unconvinced of).

Jan
Xenia Ragiadakou Oct. 26, 2023, 9:46 a.m. UTC | #4
On 26/10/23 11:45, Jan Beulich wrote:

> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
>> On 26/10/23 10:35, Jan Beulich wrote:
>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>> Given that start < kernel_end and end > kernel_start, the logic that
>>>> determines the best placement for dom0 initrd and metadata, does not
>>>> take into account the two cases below:
>>>> (1) start > kernel_start && end > kernel_end
>>>> (2) start < kernel_start && end < kernel_end
>>>>
>>>> In case (1), the evaluation will result in end = kernel_start
>>>> i.e. end < start, and will load initrd in the middle of the kernel.
>>>> In case (2), the evaluation will result in start = kernel_end
>>>> i.e. end < start, and will load initrd at kernel_end, that is out
>>>> of the memory region under evaluation.
>>> I agree there is a problem if the kernel range overlaps but is not fully
>>> contained in the E820 range under inspection. I'd like to ask though
>>> under what conditions that can happen, as it seems suspicious for the
>>> kernel range to span multiple E820 ranges.
>> We tried to boot Zephyr as pvh dom0 and its load address was under 1MB.
>>
>> I know ... that maybe shouldn't have been permitted at all, but
>> nevertheless we hit this issue.
> How can they expect to have a contiguous, large enough range there?
Me too I wonder. I don't know, maybe somebody else could shed
some light on this.
>>>> This patch rephrases the if condition to include the above two cases
>>>> without affecting the behaviour for the case where
>>>> start < kernel_start && end > kernel_end
>>>>
>>>> Fixes: 73b47eea2104 ('x86/dom0: improve PVH initrd and metadata placement')
>>>> Signed-off-by: Xenia Ragiadakou<xenia.ragiadakou@amd.com>
>>>> ---
>>>>    xen/arch/x86/hvm/dom0_build.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
>>>> index c7d47d0d4c..5fc2c12f3a 100644
>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>>            if ( end <= kernel_start || start >= kernel_end )
>>>>                ; /* No overlap, nothing to do. */
>>>>            /* Deal with the kernel already being loaded in the region. */
>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>> +        else if ( kernel_start + kernel_end > start + end )
>>> What meaning has the sum of the start and end of either range? I can't
>>> figure how comparing those two values will be generally correct / useful.
>>> If the partial-overlap case needs handling in the first place, I think
>>> new conditionals need adding (and the existing one needs constraining to
>>> "kernel range fully contained") to use
>>> - as before, the larger of the non-overlapping ranges at start and end
>>>     if the kernel range is fully contained,
>>> - the tail of the range when the overlap is at the start,
>>> - the head of the range when the overlap is at the end.
>> Yes it is not quite straight forward to understand and is based on the
>> assumption that end > kernel_start and start < kernel_end, due to
>> the first condition failing.
>>
>> Both cases:
>> (start < kernel_start && end < kernel_end) and
>> (kernel_start - start > end - kernel_end)
>> fall into the condition ( kernel_start + kernel_end > start + end )
>>
>> And both the cases:
>> (start > kernel_start && end > kernel_end) and
>> (end - kernel_end > kernel_start - start)
>> fall into the condition ( kernel_start + kernel_end < start + end )
>>
>> ... unless of course I miss a case
> Well, mathematically (i.e. ignoring the potential for overflow) the
> original expression and your replacement are identical anyway. But
> overflow needs to be taken into consideration, and hence there is a
> (theoretical only at this point) risk with the replacement expression
> as well. As a result I still think that ...
>
>>> That said, in the "kernel range fully contained" case it may want
>>> considering to use the tail range if it is large enough, rather than
>>> the larger of the two ranges. In fact when switching to that model, we
>>> ought to be able to get away with one less conditional, as then the
>>> "kernel range fully contained" case doesn't need treating specially.
> ... this alternative approach may want considering (provided we need
> to make a change in the first place, which I continue to be
> unconvinced of).
Hmm, I see your point regarding the overflow.
Given that start < kernel_end and end > kernel_start, this could
be resolved by changing the above condition into:
if ( kernel_end - start > end - kernel_start )

Would that work for you?

> Jan
Andrew Cooper Oct. 26, 2023, 10:31 a.m. UTC | #5
On 26/10/2023 9:34 am, Xenia Ragiadakou wrote:
> On 26/10/23 10:35, Jan Beulich wrote:
>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>> Given that start < kernel_end and end > kernel_start, the logic that
>>> determines the best placement for dom0 initrd and metadata, does not
>>> take into account the two cases below:
>>> (1) start > kernel_start && end > kernel_end
>>> (2) start < kernel_start && end < kernel_end
>>>
>>> In case (1), the evaluation will result in end = kernel_start
>>> i.e. end < start, and will load initrd in the middle of the kernel.
>>> In case (2), the evaluation will result in start = kernel_end
>>> i.e. end < start, and will load initrd at kernel_end, that is out
>>> of the memory region under evaluation.
>> I agree there is a problem if the kernel range overlaps but is not fully
>> contained in the E820 range under inspection. I'd like to ask though
>> under what conditions that can happen, as it seems suspicious for the
>> kernel range to span multiple E820 ranges.
> 
> We tried to boot Zephyr as pvh dom0 and its load address was under 1MB.
> 
> I know ... that maybe shouldn't have been permitted at all, but
> nevertheless we hit this issue.


Zephyr is linked to run at 4k.  That's what the ELF Headers say, and the
entrypoint is not position-independent.

So trying to put the binary anywhere else is going to work about as well
having the notes section misalign the pagetables by 0x20 bytes[1].

~Andrew

[1] Guess what was causing the "Zephyr doesn't boot PVH" issues.  Which
is doubly irritating because about 6h of debugging prior, I'd pointed
out that the linker was complaining about an orphaned section and that
that ought to be fixed before trying to debug further...
Jan Beulich Oct. 26, 2023, 11:41 a.m. UTC | #6
On 26.10.2023 12:31, Andrew Cooper wrote:
> On 26/10/2023 9:34 am, Xenia Ragiadakou wrote:
>> On 26/10/23 10:35, Jan Beulich wrote:
>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>> Given that start < kernel_end and end > kernel_start, the logic that
>>>> determines the best placement for dom0 initrd and metadata, does not
>>>> take into account the two cases below:
>>>> (1) start > kernel_start && end > kernel_end
>>>> (2) start < kernel_start && end < kernel_end
>>>>
>>>> In case (1), the evaluation will result in end = kernel_start
>>>> i.e. end < start, and will load initrd in the middle of the kernel.
>>>> In case (2), the evaluation will result in start = kernel_end
>>>> i.e. end < start, and will load initrd at kernel_end, that is out
>>>> of the memory region under evaluation.
>>> I agree there is a problem if the kernel range overlaps but is not fully
>>> contained in the E820 range under inspection. I'd like to ask though
>>> under what conditions that can happen, as it seems suspicious for the
>>> kernel range to span multiple E820 ranges.
>>
>> We tried to boot Zephyr as pvh dom0 and its load address was under 1MB.
>>
>> I know ... that maybe shouldn't have been permitted at all, but
>> nevertheless we hit this issue.
> 
> 
> Zephyr is linked to run at 4k.  That's what the ELF Headers say, and the
> entrypoint is not position-independent.

Very interesting. What size is their kernel? And, Xenia, can you provide
the E820 map that you were finding the collision with?

Jan

> So trying to put the binary anywhere else is going to work about as well
> having the notes section misalign the pagetables by 0x20 bytes[1].
> 
> ~Andrew
> 
> [1] Guess what was causing the "Zephyr doesn't boot PVH" issues.  Which
> is doubly irritating because about 6h of debugging prior, I'd pointed
> out that the linker was complaining about an orphaned section and that
> that ought to be fixed before trying to debug further...
Jan Beulich Oct. 26, 2023, 11:51 a.m. UTC | #7
On 26.10.2023 11:46, Xenia Ragiadakou wrote:
> On 26/10/23 11:45, Jan Beulich wrote:
>> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>>>            if ( end <= kernel_start || start >= kernel_end )
>>>>>                ; /* No overlap, nothing to do. */
>>>>>            /* Deal with the kernel already being loaded in the region. */
>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>> +        else if ( kernel_start + kernel_end > start + end )
>>>> What meaning has the sum of the start and end of either range? I can't
>>>> figure how comparing those two values will be generally correct / useful.
>>>> If the partial-overlap case needs handling in the first place, I think
>>>> new conditionals need adding (and the existing one needs constraining to
>>>> "kernel range fully contained") to use
>>>> - as before, the larger of the non-overlapping ranges at start and end
>>>>     if the kernel range is fully contained,
>>>> - the tail of the range when the overlap is at the start,
>>>> - the head of the range when the overlap is at the end.
>>> Yes it is not quite straight forward to understand and is based on the
>>> assumption that end > kernel_start and start < kernel_end, due to
>>> the first condition failing.
>>>
>>> Both cases:
>>> (start < kernel_start && end < kernel_end) and
>>> (kernel_start - start > end - kernel_end)
>>> fall into the condition ( kernel_start + kernel_end > start + end )
>>>
>>> And both the cases:
>>> (start > kernel_start && end > kernel_end) and
>>> (end - kernel_end > kernel_start - start)
>>> fall into the condition ( kernel_start + kernel_end < start + end )
>>>
>>> ... unless of course I miss a case
>> Well, mathematically (i.e. ignoring the potential for overflow) the
>> original expression and your replacement are identical anyway. But
>> overflow needs to be taken into consideration, and hence there is a
>> (theoretical only at this point) risk with the replacement expression
>> as well. As a result I still think that ...
>>
>>>> That said, in the "kernel range fully contained" case it may want
>>>> considering to use the tail range if it is large enough, rather than
>>>> the larger of the two ranges. In fact when switching to that model, we
>>>> ought to be able to get away with one less conditional, as then the
>>>> "kernel range fully contained" case doesn't need treating specially.
>> ... this alternative approach may want considering (provided we need
>> to make a change in the first place, which I continue to be
>> unconvinced of).
> Hmm, I see your point regarding the overflow.
> Given that start < kernel_end and end > kernel_start, this could
> be resolved by changing the above condition into:
> if ( kernel_end - start > end - kernel_start )
> 
> Would that work for you?

That would look quite a bit more natural, yes. But I don't think it covers
all cases: What if the E820 range is a proper sub-range of the kernel one?
If we consider kernel range crossing E820 region boundaries, we also need
to take that possibility into account, I think.

Jan
Xenia Ragiadakou Oct. 26, 2023, 12:09 p.m. UTC | #8
On 26/10/23 14:41, Jan Beulich wrote:
> On 26.10.2023 12:31, Andrew Cooper wrote:
>> On 26/10/2023 9:34 am, Xenia Ragiadakou wrote:
>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>> Given that start < kernel_end and end > kernel_start, the logic that
>>>>> determines the best placement for dom0 initrd and metadata, does not
>>>>> take into account the two cases below:
>>>>> (1) start > kernel_start && end > kernel_end
>>>>> (2) start < kernel_start && end < kernel_end
>>>>>
>>>>> In case (1), the evaluation will result in end = kernel_start
>>>>> i.e. end < start, and will load initrd in the middle of the kernel.
>>>>> In case (2), the evaluation will result in start = kernel_end
>>>>> i.e. end < start, and will load initrd at kernel_end, that is out
>>>>> of the memory region under evaluation.
>>>> I agree there is a problem if the kernel range overlaps but is not fully
>>>> contained in the E820 range under inspection. I'd like to ask though
>>>> under what conditions that can happen, as it seems suspicious for the
>>>> kernel range to span multiple E820 ranges.
>>> We tried to boot Zephyr as pvh dom0 and its load address was under 1MB.
>>>
>>> I know ... that maybe shouldn't have been permitted at all, but
>>> nevertheless we hit this issue.
>>
>> Zephyr is linked to run at 4k.  That's what the ELF Headers say, and the
>> entrypoint is not position-independent.
> Very interesting. What size is their kernel? And, Xenia, can you provide
> the E820 map that you were finding the collision with?

Sure.

Xen-e820 RAM map:

  [0000000000000000, 000000000009fbff] (usable)
  [000000000009fc00, 000000000009ffff] (reserved)
  [00000000000f0000, 00000000000fffff] (reserved)
  [0000000000100000, 000000007ffdefff] (usable)
  [000000007ffdf000, 000000007fffffff] (reserved)
  [00000000b0000000, 00000000bfffffff] (reserved)
  [00000000fed1c000, 00000000fed1ffff] (reserved)
  [00000000fffc0000, 00000000ffffffff] (reserved)
  [0000000100000000, 000000027fffffff] (usable)

(XEN) ELF: phdr: paddr=0x1000 memsz=0x8000
(XEN) ELF: phdr: paddr=0x100000 memsz=0x28a90
(XEN) ELF: phdr: paddr=0x128aa0 memsz=0x7560
(XEN) ELF: memory: 0x1000 -> 0x130000

>> So trying to put the binary anywhere else is going to work about as well
>> having the notes section misalign the pagetables by 0x20 bytes[1].
>>
>> ~Andrew
>>
>> [1] Guess what was causing the "Zephyr doesn't boot PVH" issues.  Which
>> is doubly irritating because about 6h of debugging prior, I'd pointed
>> out that the linker was complaining about an orphaned section and that
>> that ought to be fixed before trying to debug further...
Xenia Ragiadakou Oct. 26, 2023, 12:35 p.m. UTC | #9
On 26/10/23 14:51, Jan Beulich wrote:
> On 26.10.2023 11:46, Xenia Ragiadakou wrote:
>> On 26/10/23 11:45, Jan Beulich wrote:
>>> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
>>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>>>>             if ( end <= kernel_start || start >= kernel_end )
>>>>>>                 ; /* No overlap, nothing to do. */
>>>>>>             /* Deal with the kernel already being loaded in the region. */
>>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>>> +        else if ( kernel_start + kernel_end > start + end )
>>>>> What meaning has the sum of the start and end of either range? I can't
>>>>> figure how comparing those two values will be generally correct / useful.
>>>>> If the partial-overlap case needs handling in the first place, I think
>>>>> new conditionals need adding (and the existing one needs constraining to
>>>>> "kernel range fully contained") to use
>>>>> - as before, the larger of the non-overlapping ranges at start and end
>>>>>      if the kernel range is fully contained,
>>>>> - the tail of the range when the overlap is at the start,
>>>>> - the head of the range when the overlap is at the end.
>>>> Yes it is not quite straight forward to understand and is based on the
>>>> assumption that end > kernel_start and start < kernel_end, due to
>>>> the first condition failing.
>>>>
>>>> Both cases:
>>>> (start < kernel_start && end < kernel_end) and
>>>> (kernel_start - start > end - kernel_end)
>>>> fall into the condition ( kernel_start + kernel_end > start + end )
>>>>
>>>> And both the cases:
>>>> (start > kernel_start && end > kernel_end) and
>>>> (end - kernel_end > kernel_start - start)
>>>> fall into the condition ( kernel_start + kernel_end < start + end )
>>>>
>>>> ... unless of course I miss a case
>>> Well, mathematically (i.e. ignoring the potential for overflow) the
>>> original expression and your replacement are identical anyway. But
>>> overflow needs to be taken into consideration, and hence there is a
>>> (theoretical only at this point) risk with the replacement expression
>>> as well. As a result I still think that ...
>>>
>>>>> That said, in the "kernel range fully contained" case it may want
>>>>> considering to use the tail range if it is large enough, rather than
>>>>> the larger of the two ranges. In fact when switching to that model, we
>>>>> ought to be able to get away with one less conditional, as then the
>>>>> "kernel range fully contained" case doesn't need treating specially.
>>> ... this alternative approach may want considering (provided we need
>>> to make a change in the first place, which I continue to be
>>> unconvinced of).
>> Hmm, I see your point regarding the overflow.
>> Given that start < kernel_end and end > kernel_start, this could
>> be resolved by changing the above condition into:
>> if ( kernel_end - start > end - kernel_start )
>>
>> Would that work for you?
> 
> That would look quite a bit more natural, yes. But I don't think it covers
> all cases: What if the E820 range is a proper sub-range of the kernel one?
> If we consider kernel range crossing E820 region boundaries, we also need
> to take that possibility into account, I think.

You are right, this case is not handled and can lead to either of the 
issues mentioned in commit message.
Maybe we should check whether end > start before proceeding with 
checking the size.

> 
> Jan
Jan Beulich Oct. 26, 2023, 12:35 p.m. UTC | #10
On 26.10.2023 14:09, Xenia Ragiadakou wrote:
> On 26/10/23 14:41, Jan Beulich wrote:
>> On 26.10.2023 12:31, Andrew Cooper wrote:
>>> On 26/10/2023 9:34 am, Xenia Ragiadakou wrote:
>>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>>> Given that start < kernel_end and end > kernel_start, the logic that
>>>>>> determines the best placement for dom0 initrd and metadata, does not
>>>>>> take into account the two cases below:
>>>>>> (1) start > kernel_start && end > kernel_end
>>>>>> (2) start < kernel_start && end < kernel_end
>>>>>>
>>>>>> In case (1), the evaluation will result in end = kernel_start
>>>>>> i.e. end < start, and will load initrd in the middle of the kernel.
>>>>>> In case (2), the evaluation will result in start = kernel_end
>>>>>> i.e. end < start, and will load initrd at kernel_end, that is out
>>>>>> of the memory region under evaluation.
>>>>> I agree there is a problem if the kernel range overlaps but is not fully
>>>>> contained in the E820 range under inspection. I'd like to ask though
>>>>> under what conditions that can happen, as it seems suspicious for the
>>>>> kernel range to span multiple E820 ranges.
>>>> We tried to boot Zephyr as pvh dom0 and its load address was under 1MB.
>>>>
>>>> I know ... that maybe shouldn't have been permitted at all, but
>>>> nevertheless we hit this issue.
>>>
>>> Zephyr is linked to run at 4k.  That's what the ELF Headers say, and the
>>> entrypoint is not position-independent.
>> Very interesting. What size is their kernel? And, Xenia, can you provide
>> the E820 map that you were finding the collision with?
> 
> Sure.
> 
> Xen-e820 RAM map:
> 
>   [0000000000000000, 000000000009fbff] (usable)
>   [000000000009fc00, 000000000009ffff] (reserved)
>   [00000000000f0000, 00000000000fffff] (reserved)
>   [0000000000100000, 000000007ffdefff] (usable)
>   [000000007ffdf000, 000000007fffffff] (reserved)
>   [00000000b0000000, 00000000bfffffff] (reserved)
>   [00000000fed1c000, 00000000fed1ffff] (reserved)
>   [00000000fffc0000, 00000000ffffffff] (reserved)
>   [0000000100000000, 000000027fffffff] (usable)
> 
> (XEN) ELF: phdr: paddr=0x1000 memsz=0x8000
> (XEN) ELF: phdr: paddr=0x100000 memsz=0x28a90
> (XEN) ELF: phdr: paddr=0x128aa0 memsz=0x7560
> (XEN) ELF: memory: 0x1000 -> 0x130000

Oh, so it's not any particular range that crosses any E820 boundaries,
but merely the total range including all holes which does. That
raises the (only somewhat related) question what we would do with a
kernel having a really large hole somewhere.

Jan
Jan Beulich Oct. 26, 2023, 12:37 p.m. UTC | #11
On 26.10.2023 14:35, Xenia Ragiadakou wrote:
> 
> 
> On 26/10/23 14:51, Jan Beulich wrote:
>> On 26.10.2023 11:46, Xenia Ragiadakou wrote:
>>> On 26/10/23 11:45, Jan Beulich wrote:
>>>> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
>>>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>>>>>             if ( end <= kernel_start || start >= kernel_end )
>>>>>>>                 ; /* No overlap, nothing to do. */
>>>>>>>             /* Deal with the kernel already being loaded in the region. */
>>>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>>>> +        else if ( kernel_start + kernel_end > start + end )
>>>>>> What meaning has the sum of the start and end of either range? I can't
>>>>>> figure how comparing those two values will be generally correct / useful.
>>>>>> If the partial-overlap case needs handling in the first place, I think
>>>>>> new conditionals need adding (and the existing one needs constraining to
>>>>>> "kernel range fully contained") to use
>>>>>> - as before, the larger of the non-overlapping ranges at start and end
>>>>>>      if the kernel range is fully contained,
>>>>>> - the tail of the range when the overlap is at the start,
>>>>>> - the head of the range when the overlap is at the end.
>>>>> Yes it is not quite straight forward to understand and is based on the
>>>>> assumption that end > kernel_start and start < kernel_end, due to
>>>>> the first condition failing.
>>>>>
>>>>> Both cases:
>>>>> (start < kernel_start && end < kernel_end) and
>>>>> (kernel_start - start > end - kernel_end)
>>>>> fall into the condition ( kernel_start + kernel_end > start + end )
>>>>>
>>>>> And both the cases:
>>>>> (start > kernel_start && end > kernel_end) and
>>>>> (end - kernel_end > kernel_start - start)
>>>>> fall into the condition ( kernel_start + kernel_end < start + end )
>>>>>
>>>>> ... unless of course I miss a case
>>>> Well, mathematically (i.e. ignoring the potential for overflow) the
>>>> original expression and your replacement are identical anyway. But
>>>> overflow needs to be taken into consideration, and hence there is a
>>>> (theoretical only at this point) risk with the replacement expression
>>>> as well. As a result I still think that ...
>>>>
>>>>>> That said, in the "kernel range fully contained" case it may want
>>>>>> considering to use the tail range if it is large enough, rather than
>>>>>> the larger of the two ranges. In fact when switching to that model, we
>>>>>> ought to be able to get away with one less conditional, as then the
>>>>>> "kernel range fully contained" case doesn't need treating specially.
>>>> ... this alternative approach may want considering (provided we need
>>>> to make a change in the first place, which I continue to be
>>>> unconvinced of).
>>> Hmm, I see your point regarding the overflow.
>>> Given that start < kernel_end and end > kernel_start, this could
>>> be resolved by changing the above condition into:
>>> if ( kernel_end - start > end - kernel_start )
>>>
>>> Would that work for you?
>>
>> That would look quite a bit more natural, yes. But I don't think it covers
>> all cases: What if the E820 range is a proper sub-range of the kernel one?
>> If we consider kernel range crossing E820 region boundaries, we also need
>> to take that possibility into account, I think.
> 
> You are right, this case is not handled and can lead to either of the 
> issues mentioned in commit message.
> Maybe we should check whether end > start before proceeding with 
> checking the size.

It looks like it all boils down to the alternative I did sketch out.

Jan
Xenia Ragiadakou Oct. 26, 2023, 1:58 p.m. UTC | #12
On 26/10/23 15:37, Jan Beulich wrote:
> On 26.10.2023 14:35, Xenia Ragiadakou wrote:
>>
>>
>> On 26/10/23 14:51, Jan Beulich wrote:
>>> On 26.10.2023 11:46, Xenia Ragiadakou wrote:
>>>> On 26/10/23 11:45, Jan Beulich wrote:
>>>>> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
>>>>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>>>>>>              if ( end <= kernel_start || start >= kernel_end )
>>>>>>>>                  ; /* No overlap, nothing to do. */
>>>>>>>>              /* Deal with the kernel already being loaded in the region. */
>>>>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>>>>> +        else if ( kernel_start + kernel_end > start + end )
>>>>>>> What meaning has the sum of the start and end of either range? I can't
>>>>>>> figure how comparing those two values will be generally correct / useful.
>>>>>>> If the partial-overlap case needs handling in the first place, I think
>>>>>>> new conditionals need adding (and the existing one needs constraining to
>>>>>>> "kernel range fully contained") to use
>>>>>>> - as before, the larger of the non-overlapping ranges at start and end
>>>>>>>       if the kernel range is fully contained,
>>>>>>> - the tail of the range when the overlap is at the start,
>>>>>>> - the head of the range when the overlap is at the end.
>>>>>> Yes it is not quite straight forward to understand and is based on the
>>>>>> assumption that end > kernel_start and start < kernel_end, due to
>>>>>> the first condition failing.
>>>>>>
>>>>>> Both cases:
>>>>>> (start < kernel_start && end < kernel_end) and
>>>>>> (kernel_start - start > end - kernel_end)
>>>>>> fall into the condition ( kernel_start + kernel_end > start + end )
>>>>>>
>>>>>> And both the cases:
>>>>>> (start > kernel_start && end > kernel_end) and
>>>>>> (end - kernel_end > kernel_start - start)
>>>>>> fall into the condition ( kernel_start + kernel_end < start + end )
>>>>>>
>>>>>> ... unless of course I miss a case
>>>>> Well, mathematically (i.e. ignoring the potential for overflow) the
>>>>> original expression and your replacement are identical anyway. But
>>>>> overflow needs to be taken into consideration, and hence there is a
>>>>> (theoretical only at this point) risk with the replacement expression
>>>>> as well. As a result I still think that ...
>>>>>
>>>>>>> That said, in the "kernel range fully contained" case it may want
>>>>>>> considering to use the tail range if it is large enough, rather than
>>>>>>> the larger of the two ranges. In fact when switching to that model, we
>>>>>>> ought to be able to get away with one less conditional, as then the
>>>>>>> "kernel range fully contained" case doesn't need treating specially.
>>>>> ... this alternative approach may want considering (provided we need
>>>>> to make a change in the first place, which I continue to be
>>>>> unconvinced of).
>>>> Hmm, I see your point regarding the overflow.
>>>> Given that start < kernel_end and end > kernel_start, this could
>>>> be resolved by changing the above condition into:
>>>> if ( kernel_end - start > end - kernel_start )
>>>>
>>>> Would that work for you?
>>>
>>> That would look quite a bit more natural, yes. But I don't think it covers
>>> all cases: What if the E820 range is a proper sub-range of the kernel one?
>>> If we consider kernel range crossing E820 region boundaries, we also need
>>> to take that possibility into account, I think.
>>
>> You are right, this case is not handled and can lead to either of the
>> issues mentioned in commit message.
>> Maybe we should check whether end > start before proceeding with
>> checking the size.
> 
> It looks like it all boils down to the alternative I did sketch out.

I 'm not sure I fully understood the alternative.
Do you mean sth in the lines below?

          if ( end <= kernel_start || start >= kernel_end )
              ; /* No overlap, nothing to do. */
          /* Deal with the kernel already being loaded in the region. */
-        else if ( kernel_start - start > end - kernel_end )
+        else if ( start < kernel_start && end > kernel_end ) {
+            if ( kernel_start - start > end - kernel_end )
+                end = kernel_start;
+            else
+                start = kernel_end;
+        }
+        else if ( start < kernel_start )
              end = kernel_start;
-        else
+        else if ( end > kernel_end )
              start = kernel_end;
+        else
+            continue;

          if ( end - start >= size )
              return start;

You wouldn't like to consider this approach?

          if ( end <= kernel_start || start >= kernel_end )
              ; /* No overlap, nothing to do. */
          /* Deal with the kernel already being loaded in the region. */
-        else if ( kernel_start - start > end - kernel_end )
+        else if ( kernel_end - start > end - kernel_start )
              end = kernel_start;
          else
              start = kernel_end;

-        if ( end - start >= size )
+        if ( end > start && end - start >= size )
              return start;
      }

> 
> Jan
Jan Beulich Oct. 26, 2023, 2:55 p.m. UTC | #13
On 26.10.2023 15:58, Xenia Ragiadakou wrote:
> 
> On 26/10/23 15:37, Jan Beulich wrote:
>> On 26.10.2023 14:35, Xenia Ragiadakou wrote:
>>>
>>>
>>> On 26/10/23 14:51, Jan Beulich wrote:
>>>> On 26.10.2023 11:46, Xenia Ragiadakou wrote:
>>>>> On 26/10/23 11:45, Jan Beulich wrote:
>>>>>> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
>>>>>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>>>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>>>>>>>              if ( end <= kernel_start || start >= kernel_end )
>>>>>>>>>                  ; /* No overlap, nothing to do. */
>>>>>>>>>              /* Deal with the kernel already being loaded in the region. */
>>>>>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>>>>>> +        else if ( kernel_start + kernel_end > start + end )
>>>>>>>> What meaning has the sum of the start and end of either range? I can't
>>>>>>>> figure how comparing those two values will be generally correct / useful.
>>>>>>>> If the partial-overlap case needs handling in the first place, I think
>>>>>>>> new conditionals need adding (and the existing one needs constraining to
>>>>>>>> "kernel range fully contained") to use
>>>>>>>> - as before, the larger of the non-overlapping ranges at start and end
>>>>>>>>       if the kernel range is fully contained,
>>>>>>>> - the tail of the range when the overlap is at the start,
>>>>>>>> - the head of the range when the overlap is at the end.
>>>>>>> Yes it is not quite straight forward to understand and is based on the
>>>>>>> assumption that end > kernel_start and start < kernel_end, due to
>>>>>>> the first condition failing.
>>>>>>>
>>>>>>> Both cases:
>>>>>>> (start < kernel_start && end < kernel_end) and
>>>>>>> (kernel_start - start > end - kernel_end)
>>>>>>> fall into the condition ( kernel_start + kernel_end > start + end )
>>>>>>>
>>>>>>> And both the cases:
>>>>>>> (start > kernel_start && end > kernel_end) and
>>>>>>> (end - kernel_end > kernel_start - start)
>>>>>>> fall into the condition ( kernel_start + kernel_end < start + end )
>>>>>>>
>>>>>>> ... unless of course I miss a case
>>>>>> Well, mathematically (i.e. ignoring the potential for overflow) the
>>>>>> original expression and your replacement are identical anyway. But
>>>>>> overflow needs to be taken into consideration, and hence there is a
>>>>>> (theoretical only at this point) risk with the replacement expression
>>>>>> as well. As a result I still think that ...
>>>>>>
>>>>>>>> That said, in the "kernel range fully contained" case it may want
>>>>>>>> considering to use the tail range if it is large enough, rather than
>>>>>>>> the larger of the two ranges. In fact when switching to that model, we
>>>>>>>> ought to be able to get away with one less conditional, as then the
>>>>>>>> "kernel range fully contained" case doesn't need treating specially.
>>>>>> ... this alternative approach may want considering (provided we need
>>>>>> to make a change in the first place, which I continue to be
>>>>>> unconvinced of).
>>>>> Hmm, I see your point regarding the overflow.
>>>>> Given that start < kernel_end and end > kernel_start, this could
>>>>> be resolved by changing the above condition into:
>>>>> if ( kernel_end - start > end - kernel_start )
>>>>>
>>>>> Would that work for you?
>>>>
>>>> That would look quite a bit more natural, yes. But I don't think it covers
>>>> all cases: What if the E820 range is a proper sub-range of the kernel one?
>>>> If we consider kernel range crossing E820 region boundaries, we also need
>>>> to take that possibility into account, I think.
>>>
>>> You are right, this case is not handled and can lead to either of the
>>> issues mentioned in commit message.
>>> Maybe we should check whether end > start before proceeding with
>>> checking the size.
>>
>> It looks like it all boils down to the alternative I did sketch out.
> 
> I 'm not sure I fully understood the alternative.
> Do you mean sth in the lines below?
> 
>           if ( end <= kernel_start || start >= kernel_end )
>               ; /* No overlap, nothing to do. */
>           /* Deal with the kernel already being loaded in the region. */
> -        else if ( kernel_start - start > end - kernel_end )
> +        else if ( start < kernel_start && end > kernel_end ) {
> +            if ( kernel_start - start > end - kernel_end )
> +                end = kernel_start;
> +            else
> +                start = kernel_end;
> +        }
> +        else if ( start < kernel_start )
>               end = kernel_start;
> -        else
> +        else if ( end > kernel_end )
>               start = kernel_end;
> +        else
> +            continue;
> 
>           if ( end - start >= size )
>               return start;

Not exactly, no, because this still takes the size into account only
in this final if().

> You wouldn't like to consider this approach?

I'm happy to consider any other approach. Just that ...

>           if ( end <= kernel_start || start >= kernel_end )
>               ; /* No overlap, nothing to do. */
>           /* Deal with the kernel already being loaded in the region. */
> -        else if ( kernel_start - start > end - kernel_end )
> +        else if ( kernel_end - start > end - kernel_start )
>               end = kernel_start;
>           else
>               start = kernel_end;
> 
> -        if ( end - start >= size )
> +        if ( end > start && end - start >= size )
>               return start;
>       }

... I'm afraid this doesn't deal well with the specific case I was
mentioning: If the E820 region is fully contained in the kernel range,
it looks to me as if this approach would ignore the E820 altogether,
since you either move end ahead of start or start past end then. Both
head and tail regions may be large enough in this case, and if this
was the only region above 1M, there'd be no other space to fall back
to.

Jan
Roger Pau Monné Oct. 26, 2023, 2:58 p.m. UTC | #14
On Thu, Oct 26, 2023 at 03:09:04PM +0300, Xenia Ragiadakou wrote:
> On 26/10/23 14:41, Jan Beulich wrote:
> > On 26.10.2023 12:31, Andrew Cooper wrote:
> > > On 26/10/2023 9:34 am, Xenia Ragiadakou wrote:
> > > > On 26/10/23 10:35, Jan Beulich wrote:
> > > > > On 26.10.2023 08:45, Xenia Ragiadakou wrote:
> > > > > > Given that start < kernel_end and end > kernel_start, the logic that
> > > > > > determines the best placement for dom0 initrd and metadata, does not
> > > > > > take into account the two cases below:
> > > > > > (1) start > kernel_start && end > kernel_end
> > > > > > (2) start < kernel_start && end < kernel_end
> > > > > > 
> > > > > > In case (1), the evaluation will result in end = kernel_start
> > > > > > i.e. end < start, and will load initrd in the middle of the kernel.
> > > > > > In case (2), the evaluation will result in start = kernel_end
> > > > > > i.e. end < start, and will load initrd at kernel_end, that is out
> > > > > > of the memory region under evaluation.
> > > > > I agree there is a problem if the kernel range overlaps but is not fully
> > > > > contained in the E820 range under inspection. I'd like to ask though
> > > > > under what conditions that can happen, as it seems suspicious for the
> > > > > kernel range to span multiple E820 ranges.
> > > > We tried to boot Zephyr as pvh dom0 and its load address was under 1MB.
> > > > 
> > > > I know ... that maybe shouldn't have been permitted at all, but
> > > > nevertheless we hit this issue.
> > > 
> > > Zephyr is linked to run at 4k.  That's what the ELF Headers say, and the
> > > entrypoint is not position-independent.
> > Very interesting. What size is their kernel? And, Xenia, can you provide
> > the E820 map that you were finding the collision with?
> 
> Sure.
> 
> Xen-e820 RAM map:
> 
>  [0000000000000000, 000000000009fbff] (usable)
>  [000000000009fc00, 000000000009ffff] (reserved)
>  [00000000000f0000, 00000000000fffff] (reserved)
>  [0000000000100000, 000000007ffdefff] (usable)
>  [000000007ffdf000, 000000007fffffff] (reserved)
>  [00000000b0000000, 00000000bfffffff] (reserved)
>  [00000000fed1c000, 00000000fed1ffff] (reserved)
>  [00000000fffc0000, 00000000ffffffff] (reserved)
>  [0000000100000000, 000000027fffffff] (usable)
> 
> (XEN) ELF: phdr: paddr=0x1000 memsz=0x8000
> (XEN) ELF: phdr: paddr=0x100000 memsz=0x28a90
> (XEN) ELF: phdr: paddr=0x128aa0 memsz=0x7560
> (XEN) ELF: memory: 0x1000 -> 0x130000

Interesting, so far we have accommodated for the program headers
containing physical addresses for a mostly contiguous region, and the
assumption was that it would all fit into a single RAM region.

If we have to support elfs with such scattered loaded regions we
should start using a rangeset or similar in find_memory() in order to
have a clear picture of the available memory ranges suitable to load
the kernel metadata.

Thanks, Roger.
Roger Pau Monné Oct. 26, 2023, 3:01 p.m. UTC | #15
On Thu, Oct 26, 2023 at 04:55:36PM +0200, Jan Beulich wrote:
> On 26.10.2023 15:58, Xenia Ragiadakou wrote:
> > 
> > On 26/10/23 15:37, Jan Beulich wrote:
> >> On 26.10.2023 14:35, Xenia Ragiadakou wrote:
> >>>
> >>>
> >>> On 26/10/23 14:51, Jan Beulich wrote:
> >>>> On 26.10.2023 11:46, Xenia Ragiadakou wrote:
> >>>>> On 26/10/23 11:45, Jan Beulich wrote:
> >>>>>> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
> >>>>>>> On 26/10/23 10:35, Jan Beulich wrote:
> >>>>>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
> >>>>>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
> >>>>>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
> >>>>>>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
> >>>>>>>>>              if ( end <= kernel_start || start >= kernel_end )
> >>>>>>>>>                  ; /* No overlap, nothing to do. */
> >>>>>>>>>              /* Deal with the kernel already being loaded in the region. */
> >>>>>>>>> -        else if ( kernel_start - start > end - kernel_end )
> >>>>>>>>> +        else if ( kernel_start + kernel_end > start + end )
> >>>>>>>> What meaning has the sum of the start and end of either range? I can't
> >>>>>>>> figure how comparing those two values will be generally correct / useful.
> >>>>>>>> If the partial-overlap case needs handling in the first place, I think
> >>>>>>>> new conditionals need adding (and the existing one needs constraining to
> >>>>>>>> "kernel range fully contained") to use
> >>>>>>>> - as before, the larger of the non-overlapping ranges at start and end
> >>>>>>>>       if the kernel range is fully contained,
> >>>>>>>> - the tail of the range when the overlap is at the start,
> >>>>>>>> - the head of the range when the overlap is at the end.
> >>>>>>> Yes it is not quite straight forward to understand and is based on the
> >>>>>>> assumption that end > kernel_start and start < kernel_end, due to
> >>>>>>> the first condition failing.
> >>>>>>>
> >>>>>>> Both cases:
> >>>>>>> (start < kernel_start && end < kernel_end) and
> >>>>>>> (kernel_start - start > end - kernel_end)
> >>>>>>> fall into the condition ( kernel_start + kernel_end > start + end )
> >>>>>>>
> >>>>>>> And both the cases:
> >>>>>>> (start > kernel_start && end > kernel_end) and
> >>>>>>> (end - kernel_end > kernel_start - start)
> >>>>>>> fall into the condition ( kernel_start + kernel_end < start + end )
> >>>>>>>
> >>>>>>> ... unless of course I miss a case
> >>>>>> Well, mathematically (i.e. ignoring the potential for overflow) the
> >>>>>> original expression and your replacement are identical anyway. But
> >>>>>> overflow needs to be taken into consideration, and hence there is a
> >>>>>> (theoretical only at this point) risk with the replacement expression
> >>>>>> as well. As a result I still think that ...
> >>>>>>
> >>>>>>>> That said, in the "kernel range fully contained" case it may want
> >>>>>>>> considering to use the tail range if it is large enough, rather than
> >>>>>>>> the larger of the two ranges. In fact when switching to that model, we
> >>>>>>>> ought to be able to get away with one less conditional, as then the
> >>>>>>>> "kernel range fully contained" case doesn't need treating specially.
> >>>>>> ... this alternative approach may want considering (provided we need
> >>>>>> to make a change in the first place, which I continue to be
> >>>>>> unconvinced of).
> >>>>> Hmm, I see your point regarding the overflow.
> >>>>> Given that start < kernel_end and end > kernel_start, this could
> >>>>> be resolved by changing the above condition into:
> >>>>> if ( kernel_end - start > end - kernel_start )
> >>>>>
> >>>>> Would that work for you?
> >>>>
> >>>> That would look quite a bit more natural, yes. But I don't think it covers
> >>>> all cases: What if the E820 range is a proper sub-range of the kernel one?
> >>>> If we consider kernel range crossing E820 region boundaries, we also need
> >>>> to take that possibility into account, I think.
> >>>
> >>> You are right, this case is not handled and can lead to either of the
> >>> issues mentioned in commit message.
> >>> Maybe we should check whether end > start before proceeding with
> >>> checking the size.
> >>
> >> It looks like it all boils down to the alternative I did sketch out.
> > 
> > I 'm not sure I fully understood the alternative.
> > Do you mean sth in the lines below?
> > 
> >           if ( end <= kernel_start || start >= kernel_end )
> >               ; /* No overlap, nothing to do. */
> >           /* Deal with the kernel already being loaded in the region. */
> > -        else if ( kernel_start - start > end - kernel_end )
> > +        else if ( start < kernel_start && end > kernel_end ) {
> > +            if ( kernel_start - start > end - kernel_end )
> > +                end = kernel_start;
> > +            else
> > +                start = kernel_end;
> > +        }
> > +        else if ( start < kernel_start )
> >               end = kernel_start;
> > -        else
> > +        else if ( end > kernel_end )
> >               start = kernel_end;
> > +        else
> > +            continue;
> > 
> >           if ( end - start >= size )
> >               return start;
> 
> Not exactly, no, because this still takes the size into account only
> in this final if().
> 
> > You wouldn't like to consider this approach?
> 
> I'm happy to consider any other approach. Just that ...
> 
> >           if ( end <= kernel_start || start >= kernel_end )
> >               ; /* No overlap, nothing to do. */
> >           /* Deal with the kernel already being loaded in the region. */
> > -        else if ( kernel_start - start > end - kernel_end )
> > +        else if ( kernel_end - start > end - kernel_start )
> >               end = kernel_start;
> >           else
> >               start = kernel_end;
> > 
> > -        if ( end - start >= size )
> > +        if ( end > start && end - start >= size )
> >               return start;
> >       }
> 
> ... I'm afraid this doesn't deal well with the specific case I was
> mentioning: If the E820 region is fully contained in the kernel range,
> it looks to me as if this approach would ignore the E820 altogether,
> since you either move end ahead of start or start past end then. Both
> head and tail regions may be large enough in this case, and if this
> was the only region above 1M, there'd be no other space to fall back
> to.

I think the only sane option and more robust if we have to start
supporting kernels with a set of scattered program headers physical
addresses is to populate a rangeset with all the RAM ranges,
subtract all memory used by the kernel and then find a suitable
region for the metadata.

Roger.
Xenia Ragiadakou Oct. 26, 2023, 4:55 p.m. UTC | #16
On 26/10/23 17:55, Jan Beulich wrote:
> On 26.10.2023 15:58, Xenia Ragiadakou wrote:
>>
>> On 26/10/23 15:37, Jan Beulich wrote:
>>> On 26.10.2023 14:35, Xenia Ragiadakou wrote:
>>>>
>>>>
>>>> On 26/10/23 14:51, Jan Beulich wrote:
>>>>> On 26.10.2023 11:46, Xenia Ragiadakou wrote:
>>>>>> On 26/10/23 11:45, Jan Beulich wrote:
>>>>>>> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
>>>>>>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>>>>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>>>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>>>>>>>>               if ( end <= kernel_start || start >= kernel_end )
>>>>>>>>>>                   ; /* No overlap, nothing to do. */
>>>>>>>>>>               /* Deal with the kernel already being loaded in the region. */
>>>>>>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>>>>>>> +        else if ( kernel_start + kernel_end > start + end )
>>>>>>>>> What meaning has the sum of the start and end of either range? I can't
>>>>>>>>> figure how comparing those two values will be generally correct / useful.
>>>>>>>>> If the partial-overlap case needs handling in the first place, I think
>>>>>>>>> new conditionals need adding (and the existing one needs constraining to
>>>>>>>>> "kernel range fully contained") to use
>>>>>>>>> - as before, the larger of the non-overlapping ranges at start and end
>>>>>>>>>        if the kernel range is fully contained,
>>>>>>>>> - the tail of the range when the overlap is at the start,
>>>>>>>>> - the head of the range when the overlap is at the end.
>>>>>>>> Yes it is not quite straight forward to understand and is based on the
>>>>>>>> assumption that end > kernel_start and start < kernel_end, due to
>>>>>>>> the first condition failing.
>>>>>>>>
>>>>>>>> Both cases:
>>>>>>>> (start < kernel_start && end < kernel_end) and
>>>>>>>> (kernel_start - start > end - kernel_end)
>>>>>>>> fall into the condition ( kernel_start + kernel_end > start + end )
>>>>>>>>
>>>>>>>> And both the cases:
>>>>>>>> (start > kernel_start && end > kernel_end) and
>>>>>>>> (end - kernel_end > kernel_start - start)
>>>>>>>> fall into the condition ( kernel_start + kernel_end < start + end )
>>>>>>>>
>>>>>>>> ... unless of course I miss a case
>>>>>>> Well, mathematically (i.e. ignoring the potential for overflow) the
>>>>>>> original expression and your replacement are identical anyway. But
>>>>>>> overflow needs to be taken into consideration, and hence there is a
>>>>>>> (theoretical only at this point) risk with the replacement expression
>>>>>>> as well. As a result I still think that ...
>>>>>>>
>>>>>>>>> That said, in the "kernel range fully contained" case it may want
>>>>>>>>> considering to use the tail range if it is large enough, rather than
>>>>>>>>> the larger of the two ranges. In fact when switching to that model, we
>>>>>>>>> ought to be able to get away with one less conditional, as then the
>>>>>>>>> "kernel range fully contained" case doesn't need treating specially.
>>>>>>> ... this alternative approach may want considering (provided we need
>>>>>>> to make a change in the first place, which I continue to be
>>>>>>> unconvinced of).
>>>>>> Hmm, I see your point regarding the overflow.
>>>>>> Given that start < kernel_end and end > kernel_start, this could
>>>>>> be resolved by changing the above condition into:
>>>>>> if ( kernel_end - start > end - kernel_start )
>>>>>>
>>>>>> Would that work for you?
>>>>>
>>>>> That would look quite a bit more natural, yes. But I don't think it covers
>>>>> all cases: What if the E820 range is a proper sub-range of the kernel one?
>>>>> If we consider kernel range crossing E820 region boundaries, we also need
>>>>> to take that possibility into account, I think.
>>>>
>>>> You are right, this case is not handled and can lead to either of the
>>>> issues mentioned in commit message.
>>>> Maybe we should check whether end > start before proceeding with
>>>> checking the size.
>>>
>>> It looks like it all boils down to the alternative I did sketch out.
>>
>> I 'm not sure I fully understood the alternative.
>> Do you mean sth in the lines below?
>>
>>            if ( end <= kernel_start || start >= kernel_end )
>>                ; /* No overlap, nothing to do. */
>>            /* Deal with the kernel already being loaded in the region. */
>> -        else if ( kernel_start - start > end - kernel_end )
>> +        else if ( start < kernel_start && end > kernel_end ) {
>> +            if ( kernel_start - start > end - kernel_end )
>> +                end = kernel_start;
>> +            else
>> +                start = kernel_end;
>> +        }
>> +        else if ( start < kernel_start )
>>                end = kernel_start;
>> -        else
>> +        else if ( end > kernel_end )
>>                start = kernel_end;
>> +        else
>> +            continue;
>>
>>            if ( end - start >= size )
>>                return start;
> 
> Not exactly, no, because this still takes the size into account only
> in this final if().
> 
>> You wouldn't like to consider this approach?
> 
> I'm happy to consider any other approach. Just that ...
> 
>>            if ( end <= kernel_start || start >= kernel_end )
>>                ; /* No overlap, nothing to do. */
>>            /* Deal with the kernel already being loaded in the region. */
>> -        else if ( kernel_start - start > end - kernel_end )
>> +        else if ( kernel_end - start > end - kernel_start )
>>                end = kernel_start;
>>            else
>>                start = kernel_end;
>>
>> -        if ( end - start >= size )
>> +        if ( end > start && end - start >= size )
>>                return start;
>>        }
> 
> ... I'm afraid this doesn't deal well with the specific case I was
> mentioning: If the E820 region is fully contained in the kernel range,
> it looks to me as if this approach would ignore the E820 altogether,
> since you either move end ahead of start or start past end then. Both
> head and tail regions may be large enough in this case, and if this
> was the only region above 1M, there'd be no other space to fall back
> to.

Yes, in which case it will fail. This is legitimate.
Currently, the code proceeds with the dom0 kernel being corrupted.

> 
> Jan
Jan Beulich Oct. 27, 2023, 6:37 a.m. UTC | #17
On 26.10.2023 18:55, Xenia Ragiadakou wrote:
> 
> 
> On 26/10/23 17:55, Jan Beulich wrote:
>> On 26.10.2023 15:58, Xenia Ragiadakou wrote:
>>>
>>> On 26/10/23 15:37, Jan Beulich wrote:
>>>> On 26.10.2023 14:35, Xenia Ragiadakou wrote:
>>>>>
>>>>>
>>>>> On 26/10/23 14:51, Jan Beulich wrote:
>>>>>> On 26.10.2023 11:46, Xenia Ragiadakou wrote:
>>>>>>> On 26/10/23 11:45, Jan Beulich wrote:
>>>>>>>> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
>>>>>>>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>>>>>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>>>>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>>>>>>>>>               if ( end <= kernel_start || start >= kernel_end )
>>>>>>>>>>>                   ; /* No overlap, nothing to do. */
>>>>>>>>>>>               /* Deal with the kernel already being loaded in the region. */
>>>>>>>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>>>>>>>> +        else if ( kernel_start + kernel_end > start + end )
>>>>>>>>>> What meaning has the sum of the start and end of either range? I can't
>>>>>>>>>> figure how comparing those two values will be generally correct / useful.
>>>>>>>>>> If the partial-overlap case needs handling in the first place, I think
>>>>>>>>>> new conditionals need adding (and the existing one needs constraining to
>>>>>>>>>> "kernel range fully contained") to use
>>>>>>>>>> - as before, the larger of the non-overlapping ranges at start and end
>>>>>>>>>>        if the kernel range is fully contained,
>>>>>>>>>> - the tail of the range when the overlap is at the start,
>>>>>>>>>> - the head of the range when the overlap is at the end.
>>>>>>>>> Yes it is not quite straight forward to understand and is based on the
>>>>>>>>> assumption that end > kernel_start and start < kernel_end, due to
>>>>>>>>> the first condition failing.
>>>>>>>>>
>>>>>>>>> Both cases:
>>>>>>>>> (start < kernel_start && end < kernel_end) and
>>>>>>>>> (kernel_start - start > end - kernel_end)
>>>>>>>>> fall into the condition ( kernel_start + kernel_end > start + end )
>>>>>>>>>
>>>>>>>>> And both the cases:
>>>>>>>>> (start > kernel_start && end > kernel_end) and
>>>>>>>>> (end - kernel_end > kernel_start - start)
>>>>>>>>> fall into the condition ( kernel_start + kernel_end < start + end )
>>>>>>>>>
>>>>>>>>> ... unless of course I miss a case
>>>>>>>> Well, mathematically (i.e. ignoring the potential for overflow) the
>>>>>>>> original expression and your replacement are identical anyway. But
>>>>>>>> overflow needs to be taken into consideration, and hence there is a
>>>>>>>> (theoretical only at this point) risk with the replacement expression
>>>>>>>> as well. As a result I still think that ...
>>>>>>>>
>>>>>>>>>> That said, in the "kernel range fully contained" case it may want
>>>>>>>>>> considering to use the tail range if it is large enough, rather than
>>>>>>>>>> the larger of the two ranges. In fact when switching to that model, we
>>>>>>>>>> ought to be able to get away with one less conditional, as then the
>>>>>>>>>> "kernel range fully contained" case doesn't need treating specially.
>>>>>>>> ... this alternative approach may want considering (provided we need
>>>>>>>> to make a change in the first place, which I continue to be
>>>>>>>> unconvinced of).
>>>>>>> Hmm, I see your point regarding the overflow.
>>>>>>> Given that start < kernel_end and end > kernel_start, this could
>>>>>>> be resolved by changing the above condition into:
>>>>>>> if ( kernel_end - start > end - kernel_start )
>>>>>>>
>>>>>>> Would that work for you?
>>>>>>
>>>>>> That would look quite a bit more natural, yes. But I don't think it covers
>>>>>> all cases: What if the E820 range is a proper sub-range of the kernel one?
>>>>>> If we consider kernel range crossing E820 region boundaries, we also need
>>>>>> to take that possibility into account, I think.
>>>>>
>>>>> You are right, this case is not handled and can lead to either of the
>>>>> issues mentioned in commit message.
>>>>> Maybe we should check whether end > start before proceeding with
>>>>> checking the size.
>>>>
>>>> It looks like it all boils down to the alternative I did sketch out.
>>>
>>> I 'm not sure I fully understood the alternative.
>>> Do you mean sth in the lines below?
>>>
>>>            if ( end <= kernel_start || start >= kernel_end )
>>>                ; /* No overlap, nothing to do. */
>>>            /* Deal with the kernel already being loaded in the region. */
>>> -        else if ( kernel_start - start > end - kernel_end )
>>> +        else if ( start < kernel_start && end > kernel_end ) {
>>> +            if ( kernel_start - start > end - kernel_end )
>>> +                end = kernel_start;
>>> +            else
>>> +                start = kernel_end;
>>> +        }
>>> +        else if ( start < kernel_start )
>>>                end = kernel_start;
>>> -        else
>>> +        else if ( end > kernel_end )
>>>                start = kernel_end;
>>> +        else
>>> +            continue;
>>>
>>>            if ( end - start >= size )
>>>                return start;
>>
>> Not exactly, no, because this still takes the size into account only
>> in this final if().
>>
>>> You wouldn't like to consider this approach?
>>
>> I'm happy to consider any other approach. Just that ...
>>
>>>            if ( end <= kernel_start || start >= kernel_end )
>>>                ; /* No overlap, nothing to do. */
>>>            /* Deal with the kernel already being loaded in the region. */
>>> -        else if ( kernel_start - start > end - kernel_end )
>>> +        else if ( kernel_end - start > end - kernel_start )
>>>                end = kernel_start;
>>>            else
>>>                start = kernel_end;
>>>
>>> -        if ( end - start >= size )
>>> +        if ( end > start && end - start >= size )
>>>                return start;
>>>        }
>>
>> ... I'm afraid this doesn't deal well with the specific case I was
>> mentioning: If the E820 region is fully contained in the kernel range,
>> it looks to me as if this approach would ignore the E820 altogether,
>> since you either move end ahead of start or start past end then. Both
>> head and tail regions may be large enough in this case, and if this
>> was the only region above 1M, there'd be no other space to fall back
>> to.
> 
> Yes, in which case it will fail. This is legitimate.

Not really, if there is space available (but just not properly used).

> Currently, the code proceeds with the dom0 kernel being corrupted.

And we agree that this wants fixing.

Jan
Xenia Ragiadakou Oct. 27, 2023, 11:18 a.m. UTC | #18
On 27/10/23 09:37, Jan Beulich wrote:
> On 26.10.2023 18:55, Xenia Ragiadakou wrote:
>>
>>
>> On 26/10/23 17:55, Jan Beulich wrote:
>>> On 26.10.2023 15:58, Xenia Ragiadakou wrote:
>>>>
>>>> On 26/10/23 15:37, Jan Beulich wrote:
>>>>> On 26.10.2023 14:35, Xenia Ragiadakou wrote:
>>>>>>
>>>>>>
>>>>>> On 26/10/23 14:51, Jan Beulich wrote:
>>>>>>> On 26.10.2023 11:46, Xenia Ragiadakou wrote:
>>>>>>>> On 26/10/23 11:45, Jan Beulich wrote:
>>>>>>>>> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
>>>>>>>>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>>>>>>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>>>>>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>>>>>>>>>>                if ( end <= kernel_start || start >= kernel_end )
>>>>>>>>>>>>                    ; /* No overlap, nothing to do. */
>>>>>>>>>>>>                /* Deal with the kernel already being loaded in the region. */
>>>>>>>>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>>>>>>>>> +        else if ( kernel_start + kernel_end > start + end )
>>>>>>>>>>> What meaning has the sum of the start and end of either range? I can't
>>>>>>>>>>> figure how comparing those two values will be generally correct / useful.
>>>>>>>>>>> If the partial-overlap case needs handling in the first place, I think
>>>>>>>>>>> new conditionals need adding (and the existing one needs constraining to
>>>>>>>>>>> "kernel range fully contained") to use
>>>>>>>>>>> - as before, the larger of the non-overlapping ranges at start and end
>>>>>>>>>>>         if the kernel range is fully contained,
>>>>>>>>>>> - the tail of the range when the overlap is at the start,
>>>>>>>>>>> - the head of the range when the overlap is at the end.
>>>>>>>>>> Yes it is not quite straight forward to understand and is based on the
>>>>>>>>>> assumption that end > kernel_start and start < kernel_end, due to
>>>>>>>>>> the first condition failing.
>>>>>>>>>>
>>>>>>>>>> Both cases:
>>>>>>>>>> (start < kernel_start && end < kernel_end) and
>>>>>>>>>> (kernel_start - start > end - kernel_end)
>>>>>>>>>> fall into the condition ( kernel_start + kernel_end > start + end )
>>>>>>>>>>
>>>>>>>>>> And both the cases:
>>>>>>>>>> (start > kernel_start && end > kernel_end) and
>>>>>>>>>> (end - kernel_end > kernel_start - start)
>>>>>>>>>> fall into the condition ( kernel_start + kernel_end < start + end )
>>>>>>>>>>
>>>>>>>>>> ... unless of course I miss a case
>>>>>>>>> Well, mathematically (i.e. ignoring the potential for overflow) the
>>>>>>>>> original expression and your replacement are identical anyway. But
>>>>>>>>> overflow needs to be taken into consideration, and hence there is a
>>>>>>>>> (theoretical only at this point) risk with the replacement expression
>>>>>>>>> as well. As a result I still think that ...
>>>>>>>>>
>>>>>>>>>>> That said, in the "kernel range fully contained" case it may want
>>>>>>>>>>> considering to use the tail range if it is large enough, rather than
>>>>>>>>>>> the larger of the two ranges. In fact when switching to that model, we
>>>>>>>>>>> ought to be able to get away with one less conditional, as then the
>>>>>>>>>>> "kernel range fully contained" case doesn't need treating specially.
>>>>>>>>> ... this alternative approach may want considering (provided we need
>>>>>>>>> to make a change in the first place, which I continue to be
>>>>>>>>> unconvinced of).
>>>>>>>> Hmm, I see your point regarding the overflow.
>>>>>>>> Given that start < kernel_end and end > kernel_start, this could
>>>>>>>> be resolved by changing the above condition into:
>>>>>>>> if ( kernel_end - start > end - kernel_start )
>>>>>>>>
>>>>>>>> Would that work for you?
>>>>>>>
>>>>>>> That would look quite a bit more natural, yes. But I don't think it covers
>>>>>>> all cases: What if the E820 range is a proper sub-range of the kernel one?
>>>>>>> If we consider kernel range crossing E820 region boundaries, we also need
>>>>>>> to take that possibility into account, I think.
>>>>>>
>>>>>> You are right, this case is not handled and can lead to either of the
>>>>>> issues mentioned in commit message.
>>>>>> Maybe we should check whether end > start before proceeding with
>>>>>> checking the size.
>>>>>
>>>>> It looks like it all boils down to the alternative I did sketch out.
>>>>
>>>> I 'm not sure I fully understood the alternative.
>>>> Do you mean sth in the lines below?
>>>>
>>>>             if ( end <= kernel_start || start >= kernel_end )
>>>>                 ; /* No overlap, nothing to do. */
>>>>             /* Deal with the kernel already being loaded in the region. */
>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>> +        else if ( start < kernel_start && end > kernel_end ) {
>>>> +            if ( kernel_start - start > end - kernel_end )
>>>> +                end = kernel_start;
>>>> +            else
>>>> +                start = kernel_end;
>>>> +        }
>>>> +        else if ( start < kernel_start )
>>>>                 end = kernel_start;
>>>> -        else
>>>> +        else if ( end > kernel_end )
>>>>                 start = kernel_end;
>>>> +        else
>>>> +            continue;
>>>>
>>>>             if ( end - start >= size )
>>>>                 return start;
>>>
>>> Not exactly, no, because this still takes the size into account only
>>> in this final if().
>>>
>>>> You wouldn't like to consider this approach?
>>>
>>> I'm happy to consider any other approach. Just that ...
>>>
>>>>             if ( end <= kernel_start || start >= kernel_end )
>>>>                 ; /* No overlap, nothing to do. */
>>>>             /* Deal with the kernel already being loaded in the region. */
>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>> +        else if ( kernel_end - start > end - kernel_start )
>>>>                 end = kernel_start;
>>>>             else
>>>>                 start = kernel_end;
>>>>
>>>> -        if ( end - start >= size )
>>>> +        if ( end > start && end - start >= size )
>>>>                 return start;
>>>>         }
>>>
>>> ... I'm afraid this doesn't deal well with the specific case I was
>>> mentioning: If the E820 region is fully contained in the kernel range,
>>> it looks to me as if this approach would ignore the E820 altogether,
>>> since you either move end ahead of start or start past end then. Both
>>> head and tail regions may be large enough in this case, and if this
>>> was the only region above 1M, there'd be no other space to fall back
>>> to.
>>
>> Yes, in which case it will fail. This is legitimate.
> 
> Not really, if there is space available (but just not properly used).

I said so because I noticed that, if, for instance, the loading address 
conflicts with a reserved memory region, xen won't attempt to relocate 
the kernel (assuming that it is relocatable). It will fail.

> 
>> Currently, the code proceeds with the dom0 kernel being corrupted.
> 
> And we agree that this wants fixing.

Ok, and IIUC, using rangeset as per Roger's suggestion, right?

> 
> Jan
Jan Beulich Oct. 27, 2023, 12:01 p.m. UTC | #19
On 27.10.2023 13:18, Xenia Ragiadakou wrote:
> 
> On 27/10/23 09:37, Jan Beulich wrote:
>> On 26.10.2023 18:55, Xenia Ragiadakou wrote:
>>>
>>>
>>> On 26/10/23 17:55, Jan Beulich wrote:
>>>> On 26.10.2023 15:58, Xenia Ragiadakou wrote:
>>>>>
>>>>> On 26/10/23 15:37, Jan Beulich wrote:
>>>>>> On 26.10.2023 14:35, Xenia Ragiadakou wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 26/10/23 14:51, Jan Beulich wrote:
>>>>>>>> On 26.10.2023 11:46, Xenia Ragiadakou wrote:
>>>>>>>>> On 26/10/23 11:45, Jan Beulich wrote:
>>>>>>>>>> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
>>>>>>>>>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>>>>>>>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>>>>>>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>>>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>>>>>>>>>>>                if ( end <= kernel_start || start >= kernel_end )
>>>>>>>>>>>>>                    ; /* No overlap, nothing to do. */
>>>>>>>>>>>>>                /* Deal with the kernel already being loaded in the region. */
>>>>>>>>>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>>>>>>>>>> +        else if ( kernel_start + kernel_end > start + end )
>>>>>>>>>>>> What meaning has the sum of the start and end of either range? I can't
>>>>>>>>>>>> figure how comparing those two values will be generally correct / useful.
>>>>>>>>>>>> If the partial-overlap case needs handling in the first place, I think
>>>>>>>>>>>> new conditionals need adding (and the existing one needs constraining to
>>>>>>>>>>>> "kernel range fully contained") to use
>>>>>>>>>>>> - as before, the larger of the non-overlapping ranges at start and end
>>>>>>>>>>>>         if the kernel range is fully contained,
>>>>>>>>>>>> - the tail of the range when the overlap is at the start,
>>>>>>>>>>>> - the head of the range when the overlap is at the end.
>>>>>>>>>>> Yes it is not quite straight forward to understand and is based on the
>>>>>>>>>>> assumption that end > kernel_start and start < kernel_end, due to
>>>>>>>>>>> the first condition failing.
>>>>>>>>>>>
>>>>>>>>>>> Both cases:
>>>>>>>>>>> (start < kernel_start && end < kernel_end) and
>>>>>>>>>>> (kernel_start - start > end - kernel_end)
>>>>>>>>>>> fall into the condition ( kernel_start + kernel_end > start + end )
>>>>>>>>>>>
>>>>>>>>>>> And both the cases:
>>>>>>>>>>> (start > kernel_start && end > kernel_end) and
>>>>>>>>>>> (end - kernel_end > kernel_start - start)
>>>>>>>>>>> fall into the condition ( kernel_start + kernel_end < start + end )
>>>>>>>>>>>
>>>>>>>>>>> ... unless of course I miss a case
>>>>>>>>>> Well, mathematically (i.e. ignoring the potential for overflow) the
>>>>>>>>>> original expression and your replacement are identical anyway. But
>>>>>>>>>> overflow needs to be taken into consideration, and hence there is a
>>>>>>>>>> (theoretical only at this point) risk with the replacement expression
>>>>>>>>>> as well. As a result I still think that ...
>>>>>>>>>>
>>>>>>>>>>>> That said, in the "kernel range fully contained" case it may want
>>>>>>>>>>>> considering to use the tail range if it is large enough, rather than
>>>>>>>>>>>> the larger of the two ranges. In fact when switching to that model, we
>>>>>>>>>>>> ought to be able to get away with one less conditional, as then the
>>>>>>>>>>>> "kernel range fully contained" case doesn't need treating specially.
>>>>>>>>>> ... this alternative approach may want considering (provided we need
>>>>>>>>>> to make a change in the first place, which I continue to be
>>>>>>>>>> unconvinced of).
>>>>>>>>> Hmm, I see your point regarding the overflow.
>>>>>>>>> Given that start < kernel_end and end > kernel_start, this could
>>>>>>>>> be resolved by changing the above condition into:
>>>>>>>>> if ( kernel_end - start > end - kernel_start )
>>>>>>>>>
>>>>>>>>> Would that work for you?
>>>>>>>>
>>>>>>>> That would look quite a bit more natural, yes. But I don't think it covers
>>>>>>>> all cases: What if the E820 range is a proper sub-range of the kernel one?
>>>>>>>> If we consider kernel range crossing E820 region boundaries, we also need
>>>>>>>> to take that possibility into account, I think.
>>>>>>>
>>>>>>> You are right, this case is not handled and can lead to either of the
>>>>>>> issues mentioned in commit message.
>>>>>>> Maybe we should check whether end > start before proceeding with
>>>>>>> checking the size.
>>>>>>
>>>>>> It looks like it all boils down to the alternative I did sketch out.
>>>>>
>>>>> I 'm not sure I fully understood the alternative.
>>>>> Do you mean sth in the lines below?
>>>>>
>>>>>             if ( end <= kernel_start || start >= kernel_end )
>>>>>                 ; /* No overlap, nothing to do. */
>>>>>             /* Deal with the kernel already being loaded in the region. */
>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>> +        else if ( start < kernel_start && end > kernel_end ) {
>>>>> +            if ( kernel_start - start > end - kernel_end )
>>>>> +                end = kernel_start;
>>>>> +            else
>>>>> +                start = kernel_end;
>>>>> +        }
>>>>> +        else if ( start < kernel_start )
>>>>>                 end = kernel_start;
>>>>> -        else
>>>>> +        else if ( end > kernel_end )
>>>>>                 start = kernel_end;
>>>>> +        else
>>>>> +            continue;
>>>>>
>>>>>             if ( end - start >= size )
>>>>>                 return start;
>>>>
>>>> Not exactly, no, because this still takes the size into account only
>>>> in this final if().
>>>>
>>>>> You wouldn't like to consider this approach?
>>>>
>>>> I'm happy to consider any other approach. Just that ...
>>>>
>>>>>             if ( end <= kernel_start || start >= kernel_end )
>>>>>                 ; /* No overlap, nothing to do. */
>>>>>             /* Deal with the kernel already being loaded in the region. */
>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>> +        else if ( kernel_end - start > end - kernel_start )
>>>>>                 end = kernel_start;
>>>>>             else
>>>>>                 start = kernel_end;
>>>>>
>>>>> -        if ( end - start >= size )
>>>>> +        if ( end > start && end - start >= size )
>>>>>                 return start;
>>>>>         }
>>>>
>>>> ... I'm afraid this doesn't deal well with the specific case I was
>>>> mentioning: If the E820 region is fully contained in the kernel range,
>>>> it looks to me as if this approach would ignore the E820 altogether,
>>>> since you either move end ahead of start or start past end then. Both
>>>> head and tail regions may be large enough in this case, and if this
>>>> was the only region above 1M, there'd be no other space to fall back
>>>> to.
>>>
>>> Yes, in which case it will fail. This is legitimate.
>>
>> Not really, if there is space available (but just not properly used).
> 
> I said so because I noticed that, if, for instance, the loading address 
> conflicts with a reserved memory region, xen won't attempt to relocate 
> the kernel (assuming that it is relocatable). It will fail.

Hmm, if so, perhaps yet something else to deal with.

>>> Currently, the code proceeds with the dom0 kernel being corrupted.
>>
>> And we agree that this wants fixing.
> 
> Ok, and IIUC, using rangeset as per Roger's suggestion, right?

Going that route would be optimal of course, but I for one wouldn't
insist.

Jan
Xenia Ragiadakou Oct. 30, 2023, 7:37 a.m. UTC | #20
On 27/10/23 15:01, Jan Beulich wrote:
> On 27.10.2023 13:18, Xenia Ragiadakou wrote:
>>
>> On 27/10/23 09:37, Jan Beulich wrote:
>>> On 26.10.2023 18:55, Xenia Ragiadakou wrote:
>>>>
>>>>
>>>> On 26/10/23 17:55, Jan Beulich wrote:
>>>>> On 26.10.2023 15:58, Xenia Ragiadakou wrote:
>>>>>>
>>>>>> On 26/10/23 15:37, Jan Beulich wrote:
>>>>>>> On 26.10.2023 14:35, Xenia Ragiadakou wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 26/10/23 14:51, Jan Beulich wrote:
>>>>>>>>> On 26.10.2023 11:46, Xenia Ragiadakou wrote:
>>>>>>>>>> On 26/10/23 11:45, Jan Beulich wrote:
>>>>>>>>>>> On 26.10.2023 10:34, Xenia Ragiadakou wrote:
>>>>>>>>>>>> On 26/10/23 10:35, Jan Beulich wrote:
>>>>>>>>>>>>> On 26.10.2023 08:45, Xenia Ragiadakou wrote:
>>>>>>>>>>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>>>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>>>>>>>>>>> @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
>>>>>>>>>>>>>>                 if ( end <= kernel_start || start >= kernel_end )
>>>>>>>>>>>>>>                     ; /* No overlap, nothing to do. */
>>>>>>>>>>>>>>                 /* Deal with the kernel already being loaded in the region. */
>>>>>>>>>>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>>>>>>>>>>> +        else if ( kernel_start + kernel_end > start + end )
>>>>>>>>>>>>> What meaning has the sum of the start and end of either range? I can't
>>>>>>>>>>>>> figure how comparing those two values will be generally correct / useful.
>>>>>>>>>>>>> If the partial-overlap case needs handling in the first place, I think
>>>>>>>>>>>>> new conditionals need adding (and the existing one needs constraining to
>>>>>>>>>>>>> "kernel range fully contained") to use
>>>>>>>>>>>>> - as before, the larger of the non-overlapping ranges at start and end
>>>>>>>>>>>>>          if the kernel range is fully contained,
>>>>>>>>>>>>> - the tail of the range when the overlap is at the start,
>>>>>>>>>>>>> - the head of the range when the overlap is at the end.
>>>>>>>>>>>> Yes it is not quite straight forward to understand and is based on the
>>>>>>>>>>>> assumption that end > kernel_start and start < kernel_end, due to
>>>>>>>>>>>> the first condition failing.
>>>>>>>>>>>>
>>>>>>>>>>>> Both cases:
>>>>>>>>>>>> (start < kernel_start && end < kernel_end) and
>>>>>>>>>>>> (kernel_start - start > end - kernel_end)
>>>>>>>>>>>> fall into the condition ( kernel_start + kernel_end > start + end )
>>>>>>>>>>>>
>>>>>>>>>>>> And both the cases:
>>>>>>>>>>>> (start > kernel_start && end > kernel_end) and
>>>>>>>>>>>> (end - kernel_end > kernel_start - start)
>>>>>>>>>>>> fall into the condition ( kernel_start + kernel_end < start + end )
>>>>>>>>>>>>
>>>>>>>>>>>> ... unless of course I miss a case
>>>>>>>>>>> Well, mathematically (i.e. ignoring the potential for overflow) the
>>>>>>>>>>> original expression and your replacement are identical anyway. But
>>>>>>>>>>> overflow needs to be taken into consideration, and hence there is a
>>>>>>>>>>> (theoretical only at this point) risk with the replacement expression
>>>>>>>>>>> as well. As a result I still think that ...
>>>>>>>>>>>
>>>>>>>>>>>>> That said, in the "kernel range fully contained" case it may want
>>>>>>>>>>>>> considering to use the tail range if it is large enough, rather than
>>>>>>>>>>>>> the larger of the two ranges. In fact when switching to that model, we
>>>>>>>>>>>>> ought to be able to get away with one less conditional, as then the
>>>>>>>>>>>>> "kernel range fully contained" case doesn't need treating specially.
>>>>>>>>>>> ... this alternative approach may want considering (provided we need
>>>>>>>>>>> to make a change in the first place, which I continue to be
>>>>>>>>>>> unconvinced of).
>>>>>>>>>> Hmm, I see your point regarding the overflow.
>>>>>>>>>> Given that start < kernel_end and end > kernel_start, this could
>>>>>>>>>> be resolved by changing the above condition into:
>>>>>>>>>> if ( kernel_end - start > end - kernel_start )
>>>>>>>>>>
>>>>>>>>>> Would that work for you?
>>>>>>>>>
>>>>>>>>> That would look quite a bit more natural, yes. But I don't think it covers
>>>>>>>>> all cases: What if the E820 range is a proper sub-range of the kernel one?
>>>>>>>>> If we consider kernel range crossing E820 region boundaries, we also need
>>>>>>>>> to take that possibility into account, I think.
>>>>>>>>
>>>>>>>> You are right, this case is not handled and can lead to either of the
>>>>>>>> issues mentioned in commit message.
>>>>>>>> Maybe we should check whether end > start before proceeding with
>>>>>>>> checking the size.
>>>>>>>
>>>>>>> It looks like it all boils down to the alternative I did sketch out.
>>>>>>
>>>>>> I 'm not sure I fully understood the alternative.
>>>>>> Do you mean sth in the lines below?
>>>>>>
>>>>>>              if ( end <= kernel_start || start >= kernel_end )
>>>>>>                  ; /* No overlap, nothing to do. */
>>>>>>              /* Deal with the kernel already being loaded in the region. */
>>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>>> +        else if ( start < kernel_start && end > kernel_end ) {
>>>>>> +            if ( kernel_start - start > end - kernel_end )
>>>>>> +                end = kernel_start;
>>>>>> +            else
>>>>>> +                start = kernel_end;
>>>>>> +        }
>>>>>> +        else if ( start < kernel_start )
>>>>>>                  end = kernel_start;
>>>>>> -        else
>>>>>> +        else if ( end > kernel_end )
>>>>>>                  start = kernel_end;
>>>>>> +        else
>>>>>> +            continue;
>>>>>>
>>>>>>              if ( end - start >= size )
>>>>>>                  return start;
>>>>>
>>>>> Not exactly, no, because this still takes the size into account only
>>>>> in this final if().
>>>>>
>>>>>> You wouldn't like to consider this approach?
>>>>>
>>>>> I'm happy to consider any other approach. Just that ...
>>>>>
>>>>>>              if ( end <= kernel_start || start >= kernel_end )
>>>>>>                  ; /* No overlap, nothing to do. */
>>>>>>              /* Deal with the kernel already being loaded in the region. */
>>>>>> -        else if ( kernel_start - start > end - kernel_end )
>>>>>> +        else if ( kernel_end - start > end - kernel_start )
>>>>>>                  end = kernel_start;
>>>>>>              else
>>>>>>                  start = kernel_end;
>>>>>>
>>>>>> -        if ( end - start >= size )
>>>>>> +        if ( end > start && end - start >= size )
>>>>>>                  return start;
>>>>>>          }
>>>>>
>>>>> ... I'm afraid this doesn't deal well with the specific case I was
>>>>> mentioning: If the E820 region is fully contained in the kernel range,
>>>>> it looks to me as if this approach would ignore the E820 altogether,
>>>>> since you either move end ahead of start or start past end then. Both
>>>>> head and tail regions may be large enough in this case, and if this
>>>>> was the only region above 1M, there'd be no other space to fall back
>>>>> to.
>>>>
>>>> Yes, in which case it will fail. This is legitimate.
>>>
>>> Not really, if there is space available (but just not properly used).
>>
>> I said so because I noticed that, if, for instance, the loading address
>> conflicts with a reserved memory region, xen won't attempt to relocate
>> the kernel (assuming that it is relocatable). It will fail.
> 
> Hmm, if so, perhaps yet something else to deal with.
> 
>>>> Currently, the code proceeds with the dom0 kernel being corrupted.
>>>
>>> And we agree that this wants fixing.
>>
>> Ok, and IIUC, using rangeset as per Roger's suggestion, right?
> 
> Going that route would be optimal of course, but I for one wouldn't
> insist.

Roger what's your opinion on this? Would it be ok for now to send a v2 
with just the change above (i.e. that might fail if the image extends 
over a really large memory area containing all usable memory regions) 
and implement later your suggested solution (probably taking into 
account in the implementation also image relocation aspect in case of 
conflicts)?

Jan would it be possible to sketch a patch of your suggested solution 
because I 'm afraid I have not fully understood it yet and I won't be 
able to implement it properly for a v2?
Roger Pau Monné Oct. 30, 2023, 11:18 a.m. UTC | #21
On Mon, Oct 30, 2023 at 09:37:24AM +0200, Xenia Ragiadakou wrote:
> 
> On 27/10/23 15:01, Jan Beulich wrote:
> > On 27.10.2023 13:18, Xenia Ragiadakou wrote:
> > > 
> > > On 27/10/23 09:37, Jan Beulich wrote:
> > > > On 26.10.2023 18:55, Xenia Ragiadakou wrote:
> > > > > 
> > > > > 
> > > > > On 26/10/23 17:55, Jan Beulich wrote:
> > > > > > On 26.10.2023 15:58, Xenia Ragiadakou wrote:
> > > > > > > 
> > > > > > > On 26/10/23 15:37, Jan Beulich wrote:
> > > > > > > > On 26.10.2023 14:35, Xenia Ragiadakou wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > On 26/10/23 14:51, Jan Beulich wrote:
> > > > > > > > > > On 26.10.2023 11:46, Xenia Ragiadakou wrote:
> > > > > > > > > > > On 26/10/23 11:45, Jan Beulich wrote:
> > > > > > > > > > > > On 26.10.2023 10:34, Xenia Ragiadakou wrote:
> > > > > > > > > > > > > On 26/10/23 10:35, Jan Beulich wrote:
> > > > > > > > > > > > > > On 26.10.2023 08:45, Xenia Ragiadakou wrote:
> > > > > > > > > > > > > > > --- a/xen/arch/x86/hvm/dom0_build.c
> > > > > > > > > > > > > > > +++ b/xen/arch/x86/hvm/dom0_build.c
> > > > > > > > > > > > > > > @@ -518,7 +518,7 @@ static paddr_t __init find_memory(
> > > > > > > > > > > > > > >                 if ( end <= kernel_start || start >= kernel_end )
> > > > > > > > > > > > > > >                     ; /* No overlap, nothing to do. */
> > > > > > > > > > > > > > >                 /* Deal with the kernel already being loaded in the region. */
> > > > > > > > > > > > > > > -        else if ( kernel_start - start > end - kernel_end )
> > > > > > > > > > > > > > > +        else if ( kernel_start + kernel_end > start + end )
> > > > > > > > > > > > > > What meaning has the sum of the start and end of either range? I can't
> > > > > > > > > > > > > > figure how comparing those two values will be generally correct / useful.
> > > > > > > > > > > > > > If the partial-overlap case needs handling in the first place, I think
> > > > > > > > > > > > > > new conditionals need adding (and the existing one needs constraining to
> > > > > > > > > > > > > > "kernel range fully contained") to use
> > > > > > > > > > > > > > - as before, the larger of the non-overlapping ranges at start and end
> > > > > > > > > > > > > >          if the kernel range is fully contained,
> > > > > > > > > > > > > > - the tail of the range when the overlap is at the start,
> > > > > > > > > > > > > > - the head of the range when the overlap is at the end.
> > > > > > > > > > > > > Yes it is not quite straight forward to understand and is based on the
> > > > > > > > > > > > > assumption that end > kernel_start and start < kernel_end, due to
> > > > > > > > > > > > > the first condition failing.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Both cases:
> > > > > > > > > > > > > (start < kernel_start && end < kernel_end) and
> > > > > > > > > > > > > (kernel_start - start > end - kernel_end)
> > > > > > > > > > > > > fall into the condition ( kernel_start + kernel_end > start + end )
> > > > > > > > > > > > > 
> > > > > > > > > > > > > And both the cases:
> > > > > > > > > > > > > (start > kernel_start && end > kernel_end) and
> > > > > > > > > > > > > (end - kernel_end > kernel_start - start)
> > > > > > > > > > > > > fall into the condition ( kernel_start + kernel_end < start + end )
> > > > > > > > > > > > > 
> > > > > > > > > > > > > ... unless of course I miss a case
> > > > > > > > > > > > Well, mathematically (i.e. ignoring the potential for overflow) the
> > > > > > > > > > > > original expression and your replacement are identical anyway. But
> > > > > > > > > > > > overflow needs to be taken into consideration, and hence there is a
> > > > > > > > > > > > (theoretical only at this point) risk with the replacement expression
> > > > > > > > > > > > as well. As a result I still think that ...
> > > > > > > > > > > > 
> > > > > > > > > > > > > > That said, in the "kernel range fully contained" case it may want
> > > > > > > > > > > > > > considering to use the tail range if it is large enough, rather than
> > > > > > > > > > > > > > the larger of the two ranges. In fact when switching to that model, we
> > > > > > > > > > > > > > ought to be able to get away with one less conditional, as then the
> > > > > > > > > > > > > > "kernel range fully contained" case doesn't need treating specially.
> > > > > > > > > > > > ... this alternative approach may want considering (provided we need
> > > > > > > > > > > > to make a change in the first place, which I continue to be
> > > > > > > > > > > > unconvinced of).
> > > > > > > > > > > Hmm, I see your point regarding the overflow.
> > > > > > > > > > > Given that start < kernel_end and end > kernel_start, this could
> > > > > > > > > > > be resolved by changing the above condition into:
> > > > > > > > > > > if ( kernel_end - start > end - kernel_start )
> > > > > > > > > > > 
> > > > > > > > > > > Would that work for you?
> > > > > > > > > > 
> > > > > > > > > > That would look quite a bit more natural, yes. But I don't think it covers
> > > > > > > > > > all cases: What if the E820 range is a proper sub-range of the kernel one?
> > > > > > > > > > If we consider kernel range crossing E820 region boundaries, we also need
> > > > > > > > > > to take that possibility into account, I think.
> > > > > > > > > 
> > > > > > > > > You are right, this case is not handled and can lead to either of the
> > > > > > > > > issues mentioned in commit message.
> > > > > > > > > Maybe we should check whether end > start before proceeding with
> > > > > > > > > checking the size.
> > > > > > > > 
> > > > > > > > It looks like it all boils down to the alternative I did sketch out.
> > > > > > > 
> > > > > > > I 'm not sure I fully understood the alternative.
> > > > > > > Do you mean sth in the lines below?
> > > > > > > 
> > > > > > >              if ( end <= kernel_start || start >= kernel_end )
> > > > > > >                  ; /* No overlap, nothing to do. */
> > > > > > >              /* Deal with the kernel already being loaded in the region. */
> > > > > > > -        else if ( kernel_start - start > end - kernel_end )
> > > > > > > +        else if ( start < kernel_start && end > kernel_end ) {
> > > > > > > +            if ( kernel_start - start > end - kernel_end )
> > > > > > > +                end = kernel_start;
> > > > > > > +            else
> > > > > > > +                start = kernel_end;
> > > > > > > +        }
> > > > > > > +        else if ( start < kernel_start )
> > > > > > >                  end = kernel_start;
> > > > > > > -        else
> > > > > > > +        else if ( end > kernel_end )
> > > > > > >                  start = kernel_end;
> > > > > > > +        else
> > > > > > > +            continue;
> > > > > > > 
> > > > > > >              if ( end - start >= size )
> > > > > > >                  return start;
> > > > > > 
> > > > > > Not exactly, no, because this still takes the size into account only
> > > > > > in this final if().
> > > > > > 
> > > > > > > You wouldn't like to consider this approach?
> > > > > > 
> > > > > > I'm happy to consider any other approach. Just that ...
> > > > > > 
> > > > > > >              if ( end <= kernel_start || start >= kernel_end )
> > > > > > >                  ; /* No overlap, nothing to do. */
> > > > > > >              /* Deal with the kernel already being loaded in the region. */
> > > > > > > -        else if ( kernel_start - start > end - kernel_end )
> > > > > > > +        else if ( kernel_end - start > end - kernel_start )
> > > > > > >                  end = kernel_start;
> > > > > > >              else
> > > > > > >                  start = kernel_end;
> > > > > > > 
> > > > > > > -        if ( end - start >= size )
> > > > > > > +        if ( end > start && end - start >= size )
> > > > > > >                  return start;
> > > > > > >          }
> > > > > > 
> > > > > > ... I'm afraid this doesn't deal well with the specific case I was
> > > > > > mentioning: If the E820 region is fully contained in the kernel range,
> > > > > > it looks to me as if this approach would ignore the E820 altogether,
> > > > > > since you either move end ahead of start or start past end then. Both
> > > > > > head and tail regions may be large enough in this case, and if this
> > > > > > was the only region above 1M, there'd be no other space to fall back
> > > > > > to.
> > > > > 
> > > > > Yes, in which case it will fail. This is legitimate.
> > > > 
> > > > Not really, if there is space available (but just not properly used).
> > > 
> > > I said so because I noticed that, if, for instance, the loading address
> > > conflicts with a reserved memory region, xen won't attempt to relocate
> > > the kernel (assuming that it is relocatable). It will fail.
> > 
> > Hmm, if so, perhaps yet something else to deal with.
> > 
> > > > > Currently, the code proceeds with the dom0 kernel being corrupted.
> > > > 
> > > > And we agree that this wants fixing.
> > > 
> > > Ok, and IIUC, using rangeset as per Roger's suggestion, right?
> > 
> > Going that route would be optimal of course, but I for one wouldn't
> > insist.
> 
> Roger what's your opinion on this? Would it be ok for now to send a v2 with
> just the change above (i.e. that might fail if the image extends over a
> really large memory area containing all usable memory regions) and implement
> later your suggested solution (probably taking into account in the
> implementation also image relocation aspect in case of conflicts)?
> 
> Jan would it be possible to sketch a patch of your suggested solution
> because I 'm afraid I have not fully understood it yet and I won't be able
> to implement it properly for a v2?

I think something like the fully untested patch below:

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index c7d47d0d4ce6..64ec8835d53e 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -517,8 +517,12 @@ static paddr_t __init find_memory(

         if ( end <= kernel_start || start >= kernel_end )
             ; /* No overlap, nothing to do. */
+        else if ( kernel_start <= start && kernel_end >= end )
+            /* Kernel is using the full region, skip it. */
+            continue;
         /* Deal with the kernel already being loaded in the region. */
-        else if ( kernel_start - start > end - kernel_end )
+        else if ( max(kernel_start, start) - start >
+                  end - min(kernel_end, end) )
             end = kernel_start;
         else
             start = kernel_end;

Should be enough to solve the issue with a kernel consuming several
e820 RAM regions entirely?

Could you also add a note to the commit message that the current code
assumes that all kernel loaded segments are mostly contiguous and
loaded in a single RAM region, but that's not suitable for Zephyr that
has segments scattered across possibly several different RAM regions.

A TODO in the code describing that it would be best to use a rangeset
for more fine grained accounting of memory used by the loaded kernel
segments would also be helpful IMO in order to remember that we
might want to fix this in the future.

Thanks, Roger.
Jan Beulich Oct. 30, 2023, 12:12 p.m. UTC | #22
On 30.10.2023 08:37, Xenia Ragiadakou wrote:
> Jan would it be possible to sketch a patch of your suggested solution 
> because I 'm afraid I have not fully understood it yet and I won't be 
> able to implement it properly for a v2?

While what Roger sent looks to be sufficient, I still thought I'd send
my variant, which I think yields more overall consistent results. Like
Roger's this isn't really tested (beyond making sure it builds).

Jan

From: Xenia Ragiadakou <xenia.ragiadakou@amd.com>
Subject: x86/hvm/dom0: fix PVH initrd and metadata placement

Given that start < kernel_end and end > kernel_start, the logic that
determines the best placement for dom0 initrd and metadata, does not
take into account the two cases below:
(1) start > kernel_start && end > kernel_end
(2) start < kernel_start && end < kernel_end

In case (1), the evaluation will result in end = kernel_start
i.e. end < start, and will load initrd in the middle of the kernel.
In case (2), the evaluation will result in start = kernel_end
i.e. end < start, and will load initrd at kernel_end, that is out
of the memory region under evaluation.

This patch reorganizes the conditionals to include so far unconsidered
cases as well, uniformly returning the lowest available address.

Fixes: 73b47eea2104 ('x86/dom0: improve PVH initrd and metadata placement')
Signed-off-by: Xenia Ragiadakou <xenia.ragiadakou@amd.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Contrary to my original intentions, with the function preferring lower
addresses (by walking the E820 table forwards), the new cases also
return lowest-possible addresses.
---
v2: Cover further cases of overlap.

--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -515,16 +515,23 @@ static paddr_t __init find_memory(
 
         ASSERT(IS_ALIGNED(start, PAGE_SIZE) && IS_ALIGNED(end, PAGE_SIZE));
 
+        /*
+         * NB: Even better would be to use rangesets to determine a suitable
+         * range, in particular in case a kernel requests multiple heavily
+         * discontiguous regions (which right now we fold all into one big
+         * region).
+         */
         if ( end <= kernel_start || start >= kernel_end )
-            ; /* No overlap, nothing to do. */
+        {
+            /* No overlap, just check whether the region is large enough. */
+            if ( end - start >= size )
+                return start;
+        }
         /* Deal with the kernel already being loaded in the region. */
-        else if ( kernel_start - start > end - kernel_end )
-            end = kernel_start;
-        else
-            start = kernel_end;
-
-        if ( end - start >= size )
+        else if ( kernel_start > start && kernel_start - start >= size )
             return start;
+        else if ( kernel_end < end && end - kernel_end >= size )
+            return kernel_end;
     }
 
     return INVALID_PADDR;
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index c7d47d0d4c..5fc2c12f3a 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -518,7 +518,7 @@  static paddr_t __init find_memory(
         if ( end <= kernel_start || start >= kernel_end )
             ; /* No overlap, nothing to do. */
         /* Deal with the kernel already being loaded in the region. */
-        else if ( kernel_start - start > end - kernel_end )
+        else if ( kernel_start + kernel_end > start + end )
             end = kernel_start;
         else
             start = kernel_end;