Message ID | 49E52D3F.1090206@kernel.org (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On Tue, 14 Apr 2009 17:41:35 -0700 Yinghai Lu <yinghai@kernel.org> wrote: > > Impact: make more big space below 4g for assigning to unassigned pci > devices > > don't need to reserved one round after the gapstart. > > v2: Linus said: " > We've definitely seen ACPI code or integrated graphics stuff > that steals a lot of memory at the end, which means that > end-of-RAM might be not at 2GB, but at 2GB-16MB-1MB, for example (1MB > of "ACPI data", and 16MB of "stolen video ram"). > > At a minimum, if we do this, I'd like to make sure we round > up to a big boundary (eg 32MB or something - exactly because a > missing 16MB can easily be some integrated stolen video memory). > > Sure, we do that whole > > while ((gapsize >> 4) > round) > round += round; > > thing, so that if the gap is large, then we'll certainly get > to 32MB too, but I think your patch matters the most exactly when the > gap is small. Maybe we could just raise the initial minimum rounding > from 1MB to 32MB? ... > Alternatively, maybe we can make sure that we round up to at > least X bytes from the end of RAM, and to at least Y bytes from the > end of some RESERVED thing." > v3: take pci_mem_start - low_top_ram bigger than half around, aka 16M > at least Any comments on this one, Linus? Should I include your ack? Thanks,
On Thu, 16 Apr 2009, Jesse Barnes wrote: > > Any comments on this one, Linus? Should I include your ack? I'm not ready to ack it, no. I don't think the suggested patch is very clean or necessarily sensible as-is. It seems very ad-hoc. I was literally thinking of something like "round up from the last RAM by X" "round up from the last reserved region by Y" "pick the bigger of the two" with helper functions for the two cases and comments along the lines of why we do it. Something that was a bit more obvious about what it's doing and why. And no, I realize that the old code isn't that way. But the old code isn't the issue - the old code is proven over _years_ and years of testing, and works wonderfully well for a ton of very different machines. It has _one_ single known failure case, and while there clearly must be others, the point is, the old code is not what needs to be worried about. So when changing that code that has all that testing, and when the failures are so nasty and hard to debug and likely only happen on some random old laptop that has crap e820 tables and _just_ the right amount of memory, I'd really like the replacement code to be better. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, 16 Apr 2009, Jesse Barnes wrote: > > > > Any comments on this one, Linus? Should I include your ack? > > I'm not ready to ack it, no. I don't think the suggested patch is very > clean or necessarily sensible as-is. It seems very ad-hoc. > > I was literally thinking of something like > "round up from the last RAM by X" > "round up from the last reserved region by Y" > "pick the bigger of the two" > > with helper functions for the two cases and comments along the > lines of why we do it. Something that was a bit more obvious about > what it's doing and why. That's sensible - but i'd also like to inject hpa's add-on idea: if we do that then we should do it _explicitly_ and _visibly_, by injecting an artificial e820 reservation range to all expected "vulnerable" holes we cannot fully trust. We'd do that after all the fixed resources are allocated, but before dynamic PCI allocations. That prevents the PCI layer from dynamically allocating anything into that protective zone, and documents it as well (and makes it visible in boot logs, etc.) - instead of just a silent rule somewhere that no-one will really see if it breaks. Or would this be a bad idea for some obvious reason i missed? Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ingo Molnar wrote: > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Thu, 16 Apr 2009, Jesse Barnes wrote: >>> Any comments on this one, Linus? Should I include your ack? >> I'm not ready to ack it, no. I don't think the suggested patch is very >> clean or necessarily sensible as-is. It seems very ad-hoc. >> >> I was literally thinking of something like >> "round up from the last RAM by X" >> "round up from the last reserved region by Y" >> "pick the bigger of the two" >> >> with helper functions for the two cases and comments along the >> lines of why we do it. Something that was a bit more obvious about >> what it's doing and why. > > That's sensible - but i'd also like to inject hpa's add-on idea: if > we do that then we should do it _explicitly_ and _visibly_, by > injecting an artificial e820 reservation range to all expected > "vulnerable" holes we cannot fully trust. > > We'd do that after all the fixed resources are allocated, but before > dynamic PCI allocations. > > That prevents the PCI layer from dynamically allocating anything > into that protective zone, and documents it as well (and makes it > visible in boot logs, etc.) - instead of just a silent rule > somewhere that no-one will really see if it breaks. that need to do done much earlier, and much simple, just need to make that range to be reserved in e820. and later e820_setup_gap even don't need to be aligned again. YH -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Yinghai Lu wrote: > > that need to do done much earlier, and much simple, just need to make that range to be reserved in e820. > and later e820_setup_gap even don't need to be aligned again. > As long as that doesn't cause the PCI layer to move devices already assigned in this range out of it. What we want is really a "weak reserve". On the other hand, that may very well be the semantics of the existing reserved space, too (I honestly haven't looked lately.) -hpa
* Yinghai Lu <yinghai@kernel.org> wrote: > Ingo Molnar wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > >> On Thu, 16 Apr 2009, Jesse Barnes wrote: > >>> Any comments on this one, Linus? Should I include your ack? > >> I'm not ready to ack it, no. I don't think the suggested patch is very > >> clean or necessarily sensible as-is. It seems very ad-hoc. > >> > >> I was literally thinking of something like > >> "round up from the last RAM by X" > >> "round up from the last reserved region by Y" > >> "pick the bigger of the two" > >> > >> with helper functions for the two cases and comments along the > >> lines of why we do it. Something that was a bit more obvious about > >> what it's doing and why. > > > > That's sensible - but i'd also like to inject hpa's add-on idea: if > > we do that then we should do it _explicitly_ and _visibly_, by > > injecting an artificial e820 reservation range to all expected > > "vulnerable" holes we cannot fully trust. > > > > We'd do that after all the fixed resources are allocated, but before > > dynamic PCI allocations. > > > > That prevents the PCI layer from dynamically allocating anything > > into that protective zone, and documents it as well (and makes it > > visible in boot logs, etc.) - instead of just a silent rule > > somewhere that no-one will really see if it breaks. > > that need to do done much earlier, and much simple, just need to > make that range to be reserved in e820. and later e820_setup_gap > even don't need to be aligned again. Well, an alignment _check_ could still be added with a WARN_ONCE(), to make sure these assumptions hold true in future as well. This kind of stuff is generally not testable and wont break on many systems - but it can easily cripple a random 0.5% of systems, creating a lot of unhappy users. So pretty much the only solution is to be careful, robust and redundant all along. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* H. Peter Anvin <hpa@zytor.com> wrote: > Yinghai Lu wrote: > > > > that need to do done much earlier, and much simple, just need to make that range to be reserved in e820. > > and later e820_setup_gap even don't need to be aligned again. > > > > As long as that doesn't cause the PCI layer to move devices > already assigned in this range out of it. What we want is really > a "weak reserve". On the other hand, that may very well be the > semantics of the existing reserved space, too (I honestly haven't > looked lately.) We have reserve_region_with_split(), which 'wraps around' existing resources non-intrusively by creating split-up resources - preventing their forced reallocation (and preventing their possible breakage - a number of BARs dont like dynamic reallocations at all). Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-2.6/arch/x86/kernel/e820.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/e820.c +++ linux-2.6/arch/x86/kernel/e820.c @@ -619,6 +619,7 @@ __init void e820_setup_gap(void) { unsigned long gapstart, gapsize, round; int found; + unsigned long low_top_ram; gapstart = 0x10000000; gapsize = 0x400000; @@ -636,14 +637,32 @@ __init void e820_setup_gap(void) /* * See how much we want to round up: start off with - * rounding to the next 1MB area. + * rounding to the next 32MB area. */ - round = 0x100000; + round = 0x2000000; while ((gapsize >> 4) > round) round += round; + + pci_mem_start = roundup(gapstart, round); + + low_top_ram = e820_end_of_low_ram_pfn() << PAGE_SHIFT; + /* check if there is gap between last RAM below 4g to that start */ + if (pci_mem_start > low_top_ram) { + if (e820_any_mapped(low_top_ram, pci_mem_start, E820_RESERVED)) + goto out; + if (e820_any_mapped(low_top_ram, pci_mem_start, E820_ACPI)) + goto out; + if (e820_any_mapped(low_top_ram, pci_mem_start, E820_NVS)) + goto out; + + if ((pci_mem_start - low_top_ram) > (round>>1)) + goto out; + } + /* Fun with two's complement */ pci_mem_start = (gapstart + round) & -round; +out: printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n", pci_mem_start, gapstart, gapsize);
Impact: make more big space below 4g for assigning to unassigned pci devices don't need to reserved one round after the gapstart. v2: Linus said: " We've definitely seen ACPI code or integrated graphics stuff that steals a lot of memory at the end, which means that end-of-RAM might be not at 2GB, but at 2GB-16MB-1MB, for example (1MB of "ACPI data", and 16MB of "stolen video ram"). At a minimum, if we do this, I'd like to make sure we round up to a big boundary (eg 32MB or something - exactly because a missing 16MB can easily be some integrated stolen video memory). Sure, we do that whole while ((gapsize >> 4) > round) round += round; thing, so that if the gap is large, then we'll certainly get to 32MB too, but I think your patch matters the most exactly when the gap is small. Maybe we could just raise the initial minimum rounding from 1MB to 32MB? ... Alternatively, maybe we can make sure that we round up to at least X bytes from the end of RAM, and to at least Y bytes from the end of some RESERVED thing." v3: take pci_mem_start - low_top_ram bigger than half around, aka 16M at least Reported-and-tested-by: Yannick <yannick.roehlly@free.fr> Signed-off-by: Yinghai Lu <yinghai@kernel.org> --- arch/x86/kernel/e820.c | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html