diff mbox

x86/pci: make pci_mem_start to be aligned only -v3

Message ID 49E52D3F.1090206@kernel.org (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Yinghai Lu April 15, 2009, 12:41 a.m. UTC
Impact: make more big space below 4g for assigning to unassigned pci devices

don't need to reserved one round after the gapstart.

v2: Linus said: "
	We've definitely seen ACPI code or integrated graphics stuff 
	that steals a lot of memory at the end, which means that end-of-RAM might 
	be not at 2GB, but at 2GB-16MB-1MB, for example (1MB of "ACPI data", and 
	16MB of "stolen video ram").

	At a minimum, if we do this, I'd like to make sure we round up to a big 
	boundary (eg 32MB or something - exactly because a missing 16MB can easily 
	be some integrated stolen video memory).

	Sure, we do that whole

		while ((gapsize >> 4) > round)
			round += round;

	thing, so that if the gap is large, then we'll certainly get to 32MB too, 
	but I think your patch matters the most exactly when the gap is small. 
	Maybe we could just raise the initial minimum rounding from 1MB to 32MB?
	...
	Alternatively, maybe we can make sure that we round up to at least X bytes
	from the end of RAM, and to at least Y bytes from the end of some RESERVED 
	thing."
v3: take pci_mem_start - low_top_ram bigger than half around, aka 16M at least


Reported-and-tested-by: Yannick <yannick.roehlly@free.fr>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/kernel/e820.c |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jesse Barnes April 16, 2009, 4:31 p.m. UTC | #1
On Tue, 14 Apr 2009 17:41:35 -0700
Yinghai Lu <yinghai@kernel.org> wrote:

> 
> Impact: make more big space below 4g for assigning to unassigned pci
> devices
> 
> don't need to reserved one round after the gapstart.
> 
> v2: Linus said: "
> 	We've definitely seen ACPI code or integrated graphics stuff 
> 	that steals a lot of memory at the end, which means that
> end-of-RAM might be not at 2GB, but at 2GB-16MB-1MB, for example (1MB
> of "ACPI data", and 16MB of "stolen video ram").
> 
> 	At a minimum, if we do this, I'd like to make sure we round
> up to a big boundary (eg 32MB or something - exactly because a
> missing 16MB can easily be some integrated stolen video memory).
> 
> 	Sure, we do that whole
> 
> 		while ((gapsize >> 4) > round)
> 			round += round;
> 
> 	thing, so that if the gap is large, then we'll certainly get
> to 32MB too, but I think your patch matters the most exactly when the
> gap is small. Maybe we could just raise the initial minimum rounding
> from 1MB to 32MB? ...
> 	Alternatively, maybe we can make sure that we round up to at
> least X bytes from the end of RAM, and to at least Y bytes from the
> end of some RESERVED thing."
> v3: take pci_mem_start - low_top_ram bigger than half around, aka 16M
> at least

Any comments on this one, Linus?  Should I include your ack?

Thanks,
Linus Torvalds April 16, 2009, 4:44 p.m. UTC | #2
On Thu, 16 Apr 2009, Jesse Barnes wrote:
> 
> Any comments on this one, Linus?  Should I include your ack?

I'm not ready to ack it, no. I don't think the suggested patch is very 
clean or necessarily sensible as-is. It seems very ad-hoc. 

I was literally thinking of something like 
 "round up from the last RAM by X"
 "round up from the last reserved region by Y"
 "pick the bigger of the two"

with helper functions for the two cases and comments along the lines of 
why we do it. Something that was a bit more obvious about what it's doing 
and why.

And no, I realize that the old code isn't that way. But the old code isn't 
the issue - the old code is proven over _years_ and years of testing, and 
works wonderfully well for a ton of very different machines. It has _one_ 
single known failure case, and while there clearly must be others, the 
point is, the old code is not what needs to be worried about. 

So when changing that code that has all that testing, and when the 
failures are so nasty and hard to debug and likely only happen on some 
random old laptop that has crap e820 tables and _just_ the right amount 
of memory, I'd really like the replacement code to be better.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar April 16, 2009, 4:56 p.m. UTC | #3
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 16 Apr 2009, Jesse Barnes wrote:
> > 
> > Any comments on this one, Linus?  Should I include your ack?
> 
> I'm not ready to ack it, no. I don't think the suggested patch is very 
> clean or necessarily sensible as-is. It seems very ad-hoc. 
> 
> I was literally thinking of something like 
>  "round up from the last RAM by X"
>  "round up from the last reserved region by Y"
>  "pick the bigger of the two"
> 
> with helper functions for the two cases and comments along the 
> lines of why we do it. Something that was a bit more obvious about 
> what it's doing and why.

That's sensible - but i'd also like to inject hpa's add-on idea: if 
we do that then we should do it _explicitly_ and _visibly_, by 
injecting an artificial e820 reservation range to all expected 
"vulnerable" holes we cannot fully trust.

We'd do that after all the fixed resources are allocated, but before 
dynamic PCI allocations.

That prevents the PCI layer from dynamically allocating anything 
into that protective zone, and documents it as well (and makes it 
visible in boot logs, etc.) - instead of just a silent rule 
somewhere that no-one will really see if it breaks.

Or would this be a bad idea for some obvious reason i missed?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu April 16, 2009, 5:18 p.m. UTC | #4
Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Thu, 16 Apr 2009, Jesse Barnes wrote:
>>> Any comments on this one, Linus?  Should I include your ack?
>> I'm not ready to ack it, no. I don't think the suggested patch is very 
>> clean or necessarily sensible as-is. It seems very ad-hoc. 
>>
>> I was literally thinking of something like 
>>  "round up from the last RAM by X"
>>  "round up from the last reserved region by Y"
>>  "pick the bigger of the two"
>>
>> with helper functions for the two cases and comments along the 
>> lines of why we do it. Something that was a bit more obvious about 
>> what it's doing and why.
> 
> That's sensible - but i'd also like to inject hpa's add-on idea: if 
> we do that then we should do it _explicitly_ and _visibly_, by 
> injecting an artificial e820 reservation range to all expected 
> "vulnerable" holes we cannot fully trust.
> 
> We'd do that after all the fixed resources are allocated, but before 
> dynamic PCI allocations.
> 
> That prevents the PCI layer from dynamically allocating anything 
> into that protective zone, and documents it as well (and makes it 
> visible in boot logs, etc.) - instead of just a silent rule 
> somewhere that no-one will really see if it breaks.

that need to do done much earlier, and much simple, just need to make that range to be reserved in e820.
and later e820_setup_gap even don't need to be aligned again.

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
H. Peter Anvin April 16, 2009, 5:27 p.m. UTC | #5
Yinghai Lu wrote:
> 
> that need to do done much earlier, and much simple, just need to make that range to be reserved in e820.
> and later e820_setup_gap even don't need to be aligned again.
> 

As long as that doesn't cause the PCI layer to move devices already
assigned in this range out of it.  What we want is really a "weak
reserve".  On the other hand, that may very well be the semantics of the
existing reserved space, too (I honestly haven't looked lately.)

	-hpa
Ingo Molnar April 16, 2009, 5:28 p.m. UTC | #6
* Yinghai Lu <yinghai@kernel.org> wrote:

> Ingo Molnar wrote:
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> >> On Thu, 16 Apr 2009, Jesse Barnes wrote:
> >>> Any comments on this one, Linus?  Should I include your ack?
> >> I'm not ready to ack it, no. I don't think the suggested patch is very 
> >> clean or necessarily sensible as-is. It seems very ad-hoc. 
> >>
> >> I was literally thinking of something like 
> >>  "round up from the last RAM by X"
> >>  "round up from the last reserved region by Y"
> >>  "pick the bigger of the two"
> >>
> >> with helper functions for the two cases and comments along the 
> >> lines of why we do it. Something that was a bit more obvious about 
> >> what it's doing and why.
> > 
> > That's sensible - but i'd also like to inject hpa's add-on idea: if 
> > we do that then we should do it _explicitly_ and _visibly_, by 
> > injecting an artificial e820 reservation range to all expected 
> > "vulnerable" holes we cannot fully trust.
> > 
> > We'd do that after all the fixed resources are allocated, but before 
> > dynamic PCI allocations.
> > 
> > That prevents the PCI layer from dynamically allocating anything 
> > into that protective zone, and documents it as well (and makes it 
> > visible in boot logs, etc.) - instead of just a silent rule 
> > somewhere that no-one will really see if it breaks.
> 
> that need to do done much earlier, and much simple, just need to 
> make that range to be reserved in e820. and later e820_setup_gap 
> even don't need to be aligned again.

Well, an alignment _check_ could still be added with a
WARN_ONCE(), to make sure these assumptions hold true in
future as well.

This kind of stuff is generally not testable and wont break on many 
systems - but it can easily cripple a random 0.5% of systems, 
creating a lot of unhappy users.

So pretty much the only solution is to be careful, robust and 
redundant all along.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar April 16, 2009, 5:38 p.m. UTC | #7
* H. Peter Anvin <hpa@zytor.com> wrote:

> Yinghai Lu wrote:
> > 
> > that need to do done much earlier, and much simple, just need to make that range to be reserved in e820.
> > and later e820_setup_gap even don't need to be aligned again.
> > 
> 
> As long as that doesn't cause the PCI layer to move devices 
> already assigned in this range out of it.  What we want is really 
> a "weak reserve".  On the other hand, that may very well be the 
> semantics of the existing reserved space, too (I honestly haven't 
> looked lately.)

We have reserve_region_with_split(), which 'wraps around' existing 
resources non-intrusively by creating split-up resources - 
preventing their forced reallocation (and preventing their possible 
breakage - a number of BARs dont like dynamic reallocations at all).

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -619,6 +619,7 @@  __init void e820_setup_gap(void)
 {
 	unsigned long gapstart, gapsize, round;
 	int found;
+	unsigned long low_top_ram;
 
 	gapstart = 0x10000000;
 	gapsize = 0x400000;
@@ -636,14 +637,32 @@  __init void e820_setup_gap(void)
 
 	/*
 	 * See how much we want to round up: start off with
-	 * rounding to the next 1MB area.
+	 * rounding to the next 32MB area.
 	 */
-	round = 0x100000;
+	round = 0x2000000;
 	while ((gapsize >> 4) > round)
 		round += round;
+
+	pci_mem_start = roundup(gapstart, round);
+
+	low_top_ram = e820_end_of_low_ram_pfn() << PAGE_SHIFT;
+	/* check if there is gap between last RAM below 4g to that start */
+	if (pci_mem_start > low_top_ram) {
+		if (e820_any_mapped(low_top_ram, pci_mem_start, E820_RESERVED))
+			goto out;
+		if (e820_any_mapped(low_top_ram, pci_mem_start, E820_ACPI))
+			goto out;
+		if (e820_any_mapped(low_top_ram, pci_mem_start, E820_NVS))
+			goto out;
+
+		if ((pci_mem_start - low_top_ram) > (round>>1))
+			goto out;
+	}
+
 	/* Fun with two's complement */
 	pci_mem_start = (gapstart + round) & -round;
 
+out:
 	printk(KERN_INFO
 	       "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
 	       pci_mem_start, gapstart, gapsize);