Message ID | 551BFFD2.20608@plexistor.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Apr 01, 2015 at 05:25:22PM +0300, Boaz Harrosh wrote: > pfn = PFN_DOWN(ei->addr + ei->size); > > - switch (ei->type) { > - case E820_RAM: > - case E820_PRAM: > - case E820_RESERVED_KERN: > - break; > - default: > + if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN) > register_nosave_region(PFN_UP(ei->addr), pfn); > - } I guess this makes sense - if the content is persistent already we don't need to save it. > - if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) { > - if (e820.map[i].type != E820_PRAM) > - res->flags |= IORESOURCE_BUSY; > + if (((e820.map[i].type != E820_RESERVED) && > + (e820.map[i].type != E820_PRAM)) || > + res->start < (1ULL<<20)) { So now we also trigger for PRAM regions under 1ULL<<20, was that the intentional change? Honestly I don't really understand this 1ULL<<20 magic here even for the existing case. Guess this is magic from the old ISA PC days? > + res->flags |= IORESOURCE_BUSY; Guess this is the real change, and I'd love to understand why this makes a difference for you. IORESOURCE_BUSY is checked almost never, and is intented to mean it's a driver mapping.
* Christoph Hellwig <hch@lst.de> wrote: > On Wed, Apr 01, 2015 at 05:25:22PM +0300, Boaz Harrosh wrote: > > pfn = PFN_DOWN(ei->addr + ei->size); > > > > - switch (ei->type) { > > - case E820_RAM: > > - case E820_PRAM: > > - case E820_RESERVED_KERN: > > - break; > > - default: > > + if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN) > > register_nosave_region(PFN_UP(ei->addr), pfn); > > - } > > I guess this makes sense - if the content is persistent already we don't need > to save it. > > > - if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) { > > - if (e820.map[i].type != E820_PRAM) > > - res->flags |= IORESOURCE_BUSY; > > + if (((e820.map[i].type != E820_RESERVED) && > > + (e820.map[i].type != E820_PRAM)) || > > + res->start < (1ULL<<20)) { > > So now we also trigger for PRAM regions under 1ULL<<20, was that the > intentional change? Honestly I don't really understand this 1ULL<<20 > magic here even for the existing case. Guess this is magic from the > old ISA PC days? > > > + res->flags |= IORESOURCE_BUSY; > > Guess this is the real change, and I'd love to understand why this > makes a difference for you. IORESOURCE_BUSY is checked almost > never, and is intented to mean it's a driver mapping. So assuming this works on your test setup I'm inclined to squash Boaz's fixes into the original patch, assuming you see no outright bug in them. Anything else can be done as delta improvements. Agreed? Thanks, Ingo
On Thu, Apr 02, 2015 at 11:37:40AM +0200, Ingo Molnar wrote: > So assuming this works on your test setup I'm inclined to squash > Boaz's fixes into the original patch, assuming you see no outright bug > in them. Anything else can be done as delta improvements. It looks sensible, but I'd really like to understand the changes a bit better. In the meantime I'll test them on my setup.
On Thu, Apr 02, 2015 at 11:37:40AM +0200, Ingo Molnar wrote:
> So assuming this works on your test setup
Boaz's changes work fine here.
On 04/02/2015 12:30 PM, Christoph Hellwig wrote: > On Wed, Apr 01, 2015 at 05:25:22PM +0300, Boaz Harrosh wrote: >> pfn = PFN_DOWN(ei->addr + ei->size); >> >> - switch (ei->type) { >> - case E820_RAM: >> - case E820_PRAM: >> - case E820_RESERVED_KERN: >> - break; >> - default: >> + if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN) >> register_nosave_region(PFN_UP(ei->addr), pfn); >> - } > > I guess this makes sense - if the content is persistent already we don't need > to save it. > Yes >> - if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) { >> - if (e820.map[i].type != E820_PRAM) >> - res->flags |= IORESOURCE_BUSY; >> + if (((e820.map[i].type != E820_RESERVED) && >> + (e820.map[i].type != E820_PRAM)) || >> + res->start < (1ULL<<20)) { > > So now we also trigger for PRAM regions under 1ULL<<20, was that the > intentional change? Honestly I don't really understand this 1ULL<<20 > magic here even for the existing case. Guess this is magic from the > old ISA PC days? > OK so by now I have had a chance to test all cases and figure out what happened. I was running on your V1 and then V2. And had huge problems with NUMA. The thing that actually fixed everything and brought the system back to life, was already in your V3. It was the remove of reserve_pmem() and the call to memblock_reserve() and init_memory_mapping() from within. It was making a mess. So your V3 was already running nice. But I already fixed everything on my side by the time you sent V3, and what I sent here is the diff from what I had and your V3. But these are all good fixes still. (Though not fatal as V2 was) 3 small fixes here: * Adding back the memmap=! now that everything works perfectly as expected. * The one above that fixes register_nosave_region and ... >> + res->flags |= IORESOURCE_BUSY; > > Guess this is the real change, and I'd love to understand why this > makes a difference for you. IORESOURCE_BUSY is checked almost never, > and is intented to mean it's a driver mapping. This here is a very minor thing. I have lots of experience with this code and with the internals of insert_resource() from my old e820.c fixes (Which are BTW still relevant but no longer needed for any current platform) So what happens here is just the adding of the IORESOURCE_BUSY flag. Actually all these resources are already in the resource list and this code is just changing the flag. So if you are not changing the flag there is just no point in calling insert_resource(). It will change nothing. From what I understand from the history of this file and from prints that I put in this file and at the resource manager. The logic is that it wants to make all E820_XXX busy so to not let any driver try and claim them. And Also the code does not want to allow any kind of e820 resource below the 1M be allowed for drivers, reserved or otherwise. Any E820_RESERVED regions it allows for driver use by not setting IORESOURCE_BUSY. Your code and mine are effectively the same only that I optimize out the call to insert_resource() since the flags were not changed. Testing status: We had some birth problems with the new system and the APIs that changed so the testing rig broke half through the night. But we have wrinkled out the last problems and the machines are pumping away full steam, NUMA and everything. So far it looks good. I will very soon now, Send you a tested-by: That confirms these patches are just as good as what we had until now out-of-tree. (I'm running with my page-struct patches on top of your pmem driver. Will publish trees soon) Thank you for your patience Boaz
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index bfcb1a6..c87122d 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1965,6 +1965,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. or memmap=0x10000$0x18690000 + memmap=nn[KMG]!ss[KMG] + [KNL,X86] Mark specific memory as protected. + Region of memory to be used, from ss to ss+nn. + The memory region may be marked as e820 type 12 (0xc) + and is NVDIMM or ADR memory. + memory_corruption_check=0/1 [X86] Some BIOSes seem to corrupt the first 64k of memory when doing things like suspend/resume. diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index e26ca56..3572a22 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -346,7 +346,7 @@ int __init sanitize_e820_map(struct e820entry *biosmap, int max_nr_map, * continue building up new bios map based on this * information */ - if (current_type != last_type) { + if (current_type != last_type || current_type == E820_PRAM) { if (last_type != 0) { new_bios[new_bios_entry].size = change_point[chgidx]->addr - last_addr; @@ -692,14 +692,8 @@ void __init e820_mark_nosave_regions(unsigned long limit_pfn) pfn = PFN_DOWN(ei->addr + ei->size); - switch (ei->type) { - case E820_RAM: - case E820_PRAM: - case E820_RESERVED_KERN: - break; - default: + if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN) register_nosave_region(PFN_UP(ei->addr), pfn); - } if (pfn >= limit_pfn) break; @@ -880,6 +874,9 @@ static int __init parse_memmap_one(char *p) } else if (*p == '$') { start_at = memparse(p+1, &p); e820_add_region(start_at, mem_size, E820_RESERVED); + } else if (*p == '!') { + start_at = memparse(p+1, &p); + e820_add_region(start_at, mem_size, E820_PRAM); } else e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1); @@ -955,9 +952,10 @@ void __init e820_reserve_resources(void) * pci device BAR resource and insert them later in * pcibios_resource_survey() */ - if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) { - if (e820.map[i].type != E820_PRAM) - res->flags |= IORESOURCE_BUSY; + if (((e820.map[i].type != E820_RESERVED) && + (e820.map[i].type != E820_PRAM)) || + res->start < (1ULL<<20)) { + res->flags |= IORESOURCE_BUSY; insert_resource(&iomem_resource, res); } res++;
Finally with these fixes I'm able to define memmap=! regions in NUMA machines. Any combination cross or not cross NUMA boundary. And not only the memmap=! regions had problems also the real type-12 NvDIMMs had the same NUMA problems. Now it all works. Also I have kept the "don't merge PRAM" regions for ease of emulated NUMA setups. Also I have reverted the change Ch did to e820_mark_nosave_regions. From comment above the function and if I'm reading register_nosave_region() correctly, We certainly do not want the system to try and save any pmem to swap or hibernate. (Actually it will be the opposite right). Can we actually define swap on a /dev/pmemX ? ;-) Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- Documentation/kernel-parameters.txt | 6 ++++++ arch/x86/kernel/e820.c | 20 +++++++++----------- 2 files changed, 15 insertions(+), 11 deletions(-)