Message ID | 20200303115253.47449-1-roger.pau@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v2] x86/dom0: improve PVH initrd and metadata placement | expand |
On 03.03.2020 12:52, Roger Pau Monne wrote: > --- a/xen/arch/x86/hvm/dom0_build.c > +++ b/xen/arch/x86/hvm/dom0_build.c > @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d) > #undef MB1_PAGES > } > > +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf, > + size_t size) > +{ > + paddr_t kernel_start = (paddr_t)elf->dest_base; > + paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size); > + unsigned int i; > + > + for ( i = 0; i < d->arch.nr_e820; i++ ) > + { > + paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size; > + > + /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */ > + if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM ) > + continue; > + > + start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1)); > + > + if ( end <= kernel_start || start >= kernel_end ) > + ; /* No overlap, nothing to do. */ > + /* Deal with the kernel already being loaded in the region. */ > + else if ( kernel_start <= start && kernel_end > start ) Since, according to your reply on v1, [kernel_start,kernel_end) is a subset of [start,end), I understand that the <= could equally well be == - do you agree? From this then ... > + /* Truncate the start of the region. */ > + start = ROUNDUP(kernel_end, PAGE_SIZE); > + else if ( kernel_start <= end && kernel_end > end ) ... it follows that you now have two off-by-1s here, as you changed the right side of the && instead of the left one (the right side could, as per above, use == again). Using == in both places would, in lieu of a comment, imo make more visible to the reader that there is this sub-range relationship between both ranges. > + /* Truncate the end of the region. */ > + end = kernel_start; > + /* Pick the biggest of the split regions. */ Then again - wouldn't this part suffice? if start == kernel_start or end == kernel_end, one side of the "split" region would simply be empty. > + else if ( kernel_start - start > end - kernel_end ) > + end = kernel_start; > + else > + start = ROUNDUP(kernel_end, PAGE_SIZE); > + > + if ( end - start >= size ) > + return start; > + } > + > + return INVALID_PADDR; > +} > + > static int __init pvh_load_kernel(struct domain *d, const module_t *image, > unsigned long image_headroom, > module_t *initrd, void *image_base, > @@ -546,7 +585,24 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image, > return rc; > } > > - last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE); > + /* > + * Find a RAM region big enough (and that doesn't overlap with the loaded > + * kernel) in order to load the initrd and the metadata. Note it could be > + * split into smaller allocations, done it as a single region in order to > + * simplify it. I guess either "done" without "it" or "doing it"? Jan
On 03/03/2020 11:52, Roger Pau Monne wrote: > Don't assume there's going to be enough space at the tail of the > loaded kernel and instead try to find a suitable memory area where the > initrd and metadata can be loaded. > > Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> I can confirm that this fixes the "failed to boot PVH" on my Rome system. Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> We've still got the excessive-time-to-construct issues to look at, but this definitely brings things to a better position. > diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c > index eded87eaf5..33520ec1bc 100644 > --- a/xen/arch/x86/hvm/dom0_build.c > +++ b/xen/arch/x86/hvm/dom0_build.c > @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d) > #undef MB1_PAGES > } > > +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf, > + size_t size) > +{ > + paddr_t kernel_start = (paddr_t)elf->dest_base; > + paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size); > + unsigned int i; > + > + for ( i = 0; i < d->arch.nr_e820; i++ ) > + { > + paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size; > + > + /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */ The BDA is in mfn 0 so is special for other reasons*. The EBDA and IBFT are the problem datastructures. ~Andrew [*] Thinking about it, how should a PVH hardware domain reconcile its paravirtualised boot with finding itself on a BIOS or EFI system...
On Tue, Mar 03, 2020 at 04:40:36PM +0100, Jan Beulich wrote: > On 03.03.2020 12:52, Roger Pau Monne wrote: > > --- a/xen/arch/x86/hvm/dom0_build.c > > +++ b/xen/arch/x86/hvm/dom0_build.c > > @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d) > > #undef MB1_PAGES > > } > > > > +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf, > > + size_t size) > > +{ > > + paddr_t kernel_start = (paddr_t)elf->dest_base; > > + paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size); > > + unsigned int i; > > + > > + for ( i = 0; i < d->arch.nr_e820; i++ ) > > + { > > + paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size; > > + > > + /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */ > > + if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM ) > > + continue; > > + > > + start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1)); > > + > > + if ( end <= kernel_start || start >= kernel_end ) > > + ; /* No overlap, nothing to do. */ > > + /* Deal with the kernel already being loaded in the region. */ > > + else if ( kernel_start <= start && kernel_end > start ) > > Since, according to your reply on v1, [kernel_start,kernel_end) is > a subset of [start,end), I understand that the <= could equally > well be == - do you agree? From this then ... > > > + /* Truncate the start of the region. */ > > + start = ROUNDUP(kernel_end, PAGE_SIZE); > > + else if ( kernel_start <= end && kernel_end > end ) > > ... it follows that you now have two off-by-1s here, as you changed > the right side of the && instead of the left one (the right side > could, as per above, use == again). Using == in both places would, > in lieu of a comment, imo make more visible to the reader that > there is this sub-range relationship between both ranges. Right, I agree to both the above and have adjusted the conditions. > > + /* Truncate the end of the region. */ > > + end = kernel_start; > > + /* Pick the biggest of the split regions. */ > > Then again - wouldn't this part suffice? if start == kernel_start > or end == kernel_end, one side of the "split" region would simply > be empty. That's why it's using an else if construct, so that we only get here if the kernel is loaded in the middle of the region, and there are two regions left as part of the split. > > > + else if ( kernel_start - start > end - kernel_end ) > > + end = kernel_start; > > + else > > + start = ROUNDUP(kernel_end, PAGE_SIZE); > > + > > + if ( end - start >= size ) > > + return start; > > + } > > + > > + return INVALID_PADDR; > > +} > > + > > static int __init pvh_load_kernel(struct domain *d, const module_t *image, > > unsigned long image_headroom, > > module_t *initrd, void *image_base, > > @@ -546,7 +585,24 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image, > > return rc; > > } > > > > - last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE); > > + /* > > + * Find a RAM region big enough (and that doesn't overlap with the loaded > > + * kernel) in order to load the initrd and the metadata. Note it could be > > + * split into smaller allocations, done it as a single region in order to > > + * simplify it. > > I guess either "done" without "it" or "doing it"? Fixed, thanks. Roger.
On 04.03.2020 10:53, Roger Pau Monné wrote: > On Tue, Mar 03, 2020 at 04:40:36PM +0100, Jan Beulich wrote: >> On 03.03.2020 12:52, Roger Pau Monne wrote: >>> --- a/xen/arch/x86/hvm/dom0_build.c >>> +++ b/xen/arch/x86/hvm/dom0_build.c >>> @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d) >>> #undef MB1_PAGES >>> } >>> >>> +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf, >>> + size_t size) >>> +{ >>> + paddr_t kernel_start = (paddr_t)elf->dest_base; >>> + paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size); >>> + unsigned int i; >>> + >>> + for ( i = 0; i < d->arch.nr_e820; i++ ) >>> + { >>> + paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size; >>> + >>> + /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */ >>> + if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM ) >>> + continue; >>> + >>> + start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1)); >>> + >>> + if ( end <= kernel_start || start >= kernel_end ) >>> + ; /* No overlap, nothing to do. */ >>> + /* Deal with the kernel already being loaded in the region. */ >>> + else if ( kernel_start <= start && kernel_end > start ) >> >> Since, according to your reply on v1, [kernel_start,kernel_end) is >> a subset of [start,end), I understand that the <= could equally >> well be == - do you agree? From this then ... >> >>> + /* Truncate the start of the region. */ >>> + start = ROUNDUP(kernel_end, PAGE_SIZE); >>> + else if ( kernel_start <= end && kernel_end > end ) >> >> ... it follows that you now have two off-by-1s here, as you changed >> the right side of the && instead of the left one (the right side >> could, as per above, use == again). Using == in both places would, >> in lieu of a comment, imo make more visible to the reader that >> there is this sub-range relationship between both ranges. > > Right, I agree to both the above and have adjusted the conditions. > >>> + /* Truncate the end of the region. */ >>> + end = kernel_start; >>> + /* Pick the biggest of the split regions. */ >> >> Then again - wouldn't this part suffice? if start == kernel_start >> or end == kernel_end, one side of the "split" region would simply >> be empty. > > That's why it's using an else if construct, so that we only get > here if the kernel is loaded in the middle of the region, and there > are two regions left as part of the split. But my question is - do we really need the earlier parts of this if/else-if chain? Won't this latter part deal find with zero-sized ranges at head or tail of the region? Jan
On Wed, Mar 04, 2020 at 11:00:18AM +0100, Jan Beulich wrote: > On 04.03.2020 10:53, Roger Pau Monné wrote: > > On Tue, Mar 03, 2020 at 04:40:36PM +0100, Jan Beulich wrote: > >> On 03.03.2020 12:52, Roger Pau Monne wrote: > >>> --- a/xen/arch/x86/hvm/dom0_build.c > >>> +++ b/xen/arch/x86/hvm/dom0_build.c > >>> @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d) > >>> #undef MB1_PAGES > >>> } > >>> > >>> +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf, > >>> + size_t size) > >>> +{ > >>> + paddr_t kernel_start = (paddr_t)elf->dest_base; > >>> + paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size); > >>> + unsigned int i; > >>> + > >>> + for ( i = 0; i < d->arch.nr_e820; i++ ) > >>> + { > >>> + paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size; > >>> + > >>> + /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */ > >>> + if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM ) > >>> + continue; > >>> + > >>> + start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1)); > >>> + > >>> + if ( end <= kernel_start || start >= kernel_end ) > >>> + ; /* No overlap, nothing to do. */ > >>> + /* Deal with the kernel already being loaded in the region. */ > >>> + else if ( kernel_start <= start && kernel_end > start ) > >> > >> Since, according to your reply on v1, [kernel_start,kernel_end) is > >> a subset of [start,end), I understand that the <= could equally > >> well be == - do you agree? From this then ... > >> > >>> + /* Truncate the start of the region. */ > >>> + start = ROUNDUP(kernel_end, PAGE_SIZE); > >>> + else if ( kernel_start <= end && kernel_end > end ) > >> > >> ... it follows that you now have two off-by-1s here, as you changed > >> the right side of the && instead of the left one (the right side > >> could, as per above, use == again). Using == in both places would, > >> in lieu of a comment, imo make more visible to the reader that > >> there is this sub-range relationship between both ranges. > > > > Right, I agree to both the above and have adjusted the conditions. > > > >>> + /* Truncate the end of the region. */ > >>> + end = kernel_start; > >>> + /* Pick the biggest of the split regions. */ > >> > >> Then again - wouldn't this part suffice? if start == kernel_start > >> or end == kernel_end, one side of the "split" region would simply > >> be empty. > > > > That's why it's using an else if construct, so that we only get > > here if the kernel is loaded in the middle of the region, and there > > are two regions left as part of the split. > > But my question is - do we really need the earlier parts of > this if/else-if chain? Won't this latter part deal find with > zero-sized ranges at head or tail of the region? Oh, I misread your reply sorry. Yes you are right, I can achieve the same just with this last part. Thanks, Roger.
On Tue, Mar 03, 2020 at 06:47:35PM +0000, Andrew Cooper wrote: > On 03/03/2020 11:52, Roger Pau Monne wrote: > > Don't assume there's going to be enough space at the tail of the > > loaded kernel and instead try to find a suitable memory area where the > > initrd and metadata can be loaded. > > > > Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > > I can confirm that this fixes the "failed to boot PVH" on my Rome system. > > Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Thanks! > We've still got the excessive-time-to-construct issues to look at, but > this definitely brings things to a better position. Well, PV is always going to be faster to construct, since you only need to allocate memory and create the initial page tables that cover the kernel, the metadata and optionally the initrd IIRC. On PVH we need to populate the full p2m, so I think it's safe to say that PVH build time is always going to be worse than PV. That doesn't mean we can't make it faster. I have to Since I also have an AMD box that I can play with, how much memory are you assigning to dom0? > > diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c > > index eded87eaf5..33520ec1bc 100644 > > --- a/xen/arch/x86/hvm/dom0_build.c > > +++ b/xen/arch/x86/hvm/dom0_build.c > > @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d) > > #undef MB1_PAGES > > } > > > > +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf, > > + size_t size) > > +{ > > + paddr_t kernel_start = (paddr_t)elf->dest_base; > > + paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size); > > + unsigned int i; > > + > > + for ( i = 0; i < d->arch.nr_e820; i++ ) > > + { > > + paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size; > > + > > + /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */ > > The BDA is in mfn 0 so is special for other reasons*. The EBDA and IBFT > are the problem datastructures. Sure. Sorry I haven't added it to the comment. > ~Andrew > > [*] Thinking about it, how should a PVH hardware domain reconcile its > paravirtualised boot with finding itself on a BIOS or EFI system... I guess the same applies to PV which also boots using a PV path but has access to firmware. I have to admit I never looked closely at how Linux does that, for FreeBSD it's fairly easy because at least when booting from BIOS the kernel won't try to make any calls into the BIOS, and instead expect the data it requires to be provided by the loader as part of the metadata blob, together with the modules &c. Thanks, Roger.
diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c index eded87eaf5..33520ec1bc 100644 --- a/xen/arch/x86/hvm/dom0_build.c +++ b/xen/arch/x86/hvm/dom0_build.c @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d) #undef MB1_PAGES } +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf, + size_t size) +{ + paddr_t kernel_start = (paddr_t)elf->dest_base; + paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size); + unsigned int i; + + for ( i = 0; i < d->arch.nr_e820; i++ ) + { + paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size; + + /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */ + if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM ) + continue; + + start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1)); + + if ( end <= kernel_start || start >= kernel_end ) + ; /* No overlap, nothing to do. */ + /* Deal with the kernel already being loaded in the region. */ + else if ( kernel_start <= start && kernel_end > start ) + /* Truncate the start of the region. */ + start = ROUNDUP(kernel_end, PAGE_SIZE); + else if ( kernel_start <= end && kernel_end > end ) + /* Truncate the end of the region. */ + end = kernel_start; + /* Pick the biggest of the split regions. */ + else if ( kernel_start - start > end - kernel_end ) + end = kernel_start; + else + start = ROUNDUP(kernel_end, PAGE_SIZE); + + if ( end - start >= size ) + return start; + } + + return INVALID_PADDR; +} + static int __init pvh_load_kernel(struct domain *d, const module_t *image, unsigned long image_headroom, module_t *initrd, void *image_base, @@ -546,7 +585,24 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image, return rc; } - last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE); + /* + * Find a RAM region big enough (and that doesn't overlap with the loaded + * kernel) in order to load the initrd and the metadata. Note it could be + * split into smaller allocations, done it as a single region in order to + * simplify it. + */ + last_addr = find_memory(d, &elf, sizeof(start_info) + + (initrd ? ROUNDUP(initrd->mod_end, PAGE_SIZE) + + sizeof(mod) + : 0) + + (cmdline ? ROUNDUP(strlen(cmdline) + 1, + elf_64bit(&elf) ? 8 : 4) + : 0)); + if ( last_addr == INVALID_PADDR ) + { + printk("Unable to find a memory region to load initrd and metadata\n"); + return -ENOMEM; + } if ( initrd != NULL ) {
Don't assume there's going to be enough space at the tail of the loaded kernel and instead try to find a suitable memory area where the initrd and metadata can be loaded. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- Changes since v1: - Calculate end of e820 entry earlier. - Only check if the end of the region is < 1MB. - Check for range overlaps with the kernel region. - Check the region is of type RAM. - Fix off-by-one checks in range overlaps. - Add a comment about why initrd and metadata is placed together. - Add parentheses around size calculations. --- xen/arch/x86/hvm/dom0_build.c | 58 ++++++++++++++++++++++++++++++++++- 1 file changed, 57 insertions(+), 1 deletion(-)