Message ID | 20190820075128.2912224-1-songliubraving@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE | expand |
On Tue, 20 Aug 2019, Song Liu wrote: > pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case. > This is not accurate because addr may not be PUD_SIZE aligned. You fail to explain how this happened. The code before the 32bit support did always increase by PMD_SIZE. The 32bit support broke that. > In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because > of this issuse, including PMD for the irq entry table. For a memcache > like workload, this introduces about 4.5x more iTLB-load and about 2.5x > more iTLB-load-misses on a Skylake CPU. This information is largely irrelevant. What matters is the fact that this got broken and incorrectly forwards the address by PUD_SIZE which is wrong if address is not PUD_SIZE aligned. > This patch fixes this issue by adding PMD_SIZE to addr for pud_none() > case. git grep 'This patch' Documentation/process/submitting-patches.rst > Cc: stable@vger.kernel.org # v4.19+ > Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") > Signed-off-by: Song Liu <songliubraving@fb.com> > Cc: Joerg Roedel <jroedel@suse.de> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Andy Lutomirski <luto@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> > --- > arch/x86/mm/pti.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c > index b196524759ec..5a67c3015f59 100644 > --- a/arch/x86/mm/pti.c > +++ b/arch/x86/mm/pti.c > @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, > > pud = pud_offset(p4d, addr); > if (pud_none(*pud)) { > - addr += PUD_SIZE; > + addr += PMD_SIZE; The right fix is to skip forward to the next PUD boundary instead of doing this in a loop with PMD_SIZE increments. Thanks, tglx
On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote: > pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case. > This is not accurate because addr may not be PUD_SIZE aligned. > > In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because > of this issuse, including PMD for the irq entry table. For a memcache > like workload, this introduces about 4.5x more iTLB-load and about 2.5x > more iTLB-load-misses on a Skylake CPU. > > This patch fixes this issue by adding PMD_SIZE to addr for pud_none() > case. > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c > index b196524759ec..5a67c3015f59 100644 > --- a/arch/x86/mm/pti.c > +++ b/arch/x86/mm/pti.c > @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, > > pud = pud_offset(p4d, addr); > if (pud_none(*pud)) { > - addr += PUD_SIZE; > + addr += PMD_SIZE; > continue; > } I'm thinking you're right in that there's a bug here, but I'm also thinking your patch is both incomplete and broken. What that code wants to do is skip to the end of the pud, a pmd_size increase will not do that. And right below this, there's a second instance of this exact pattern. Did I get the below right? --- diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..32b20b3cb227 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end, pud = pud_offset(p4d, addr); if (pud_none(*pud)) { + addr &= PUD_MASK; addr += PUD_SIZE; continue; } pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) { + addr &= PMD_MASK; addr += PMD_SIZE; continue; }
On Tue, 20 Aug 2019, Peter Zijlstra wrote: > What that code wants to do is skip to the end of the pud, a pmd_size > increase will not do that. And right below this, there's a second > instance of this exact pattern. > > Did I get the below right? > > --- > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c > index b196524759ec..32b20b3cb227 100644 > --- a/arch/x86/mm/pti.c > +++ b/arch/x86/mm/pti.c > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end, > > pud = pud_offset(p4d, addr); > if (pud_none(*pud)) { > + addr &= PUD_MASK; > addr += PUD_SIZE; round_up(addr, PUD_SIZE); perhaps? > continue; > } > > pmd = pmd_offset(pud, addr); > if (pmd_none(*pmd)) { > + addr &= PMD_MASK; > addr += PMD_SIZE; > continue; > } >
> On Aug 20, 2019, at 2:12 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Tue, 20 Aug 2019, Song Liu wrote: > >> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case. >> This is not accurate because addr may not be PUD_SIZE aligned. > > You fail to explain how this happened. The code before the 32bit support > did always increase by PMD_SIZE. The 32bit support broke that. Will fix. > >> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because >> of this issuse, including PMD for the irq entry table. For a memcache >> like workload, this introduces about 4.5x more iTLB-load and about 2.5x >> more iTLB-load-misses on a Skylake CPU. > > This information is largely irrelevant. What matters is the fact that this > got broken and incorrectly forwards the address by PUD_SIZE which is wrong > if address is not PUD_SIZE aligned. We started looking into this because we cannot explain the regression in iTLB miss rate. I guess the patch itself explains it pretty well, so the original issue doesn't matter that much? I will remove this part. > >> This patch fixes this issue by adding PMD_SIZE to addr for pud_none() >> case. > > git grep 'This patch' Documentation/process/submitting-patches.rst Will fix. >> Cc: stable@vger.kernel.org # v4.19+ >> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") >> Signed-off-by: Song Liu <songliubraving@fb.com> >> Cc: Joerg Roedel <jroedel@suse.de> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Andy Lutomirski <luto@kernel.org> >> Cc: Peter Zijlstra <peterz@infradead.org> >> --- >> arch/x86/mm/pti.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c >> index b196524759ec..5a67c3015f59 100644 >> --- a/arch/x86/mm/pti.c >> +++ b/arch/x86/mm/pti.c >> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, >> >> pud = pud_offset(p4d, addr); >> if (pud_none(*pud)) { >> - addr += PUD_SIZE; >> + addr += PMD_SIZE; > > The right fix is to skip forward to the next PUD boundary instead of doing > this in a loop with PMD_SIZE increments. Agreed. Thanks, Song
> On Aug 20, 2019, at 3:00 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote: >> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case. >> This is not accurate because addr may not be PUD_SIZE aligned. >> >> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because >> of this issuse, including PMD for the irq entry table. For a memcache >> like workload, this introduces about 4.5x more iTLB-load and about 2.5x >> more iTLB-load-misses on a Skylake CPU. >> >> This patch fixes this issue by adding PMD_SIZE to addr for pud_none() >> case. > >> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c >> index b196524759ec..5a67c3015f59 100644 >> --- a/arch/x86/mm/pti.c >> +++ b/arch/x86/mm/pti.c >> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, >> >> pud = pud_offset(p4d, addr); >> if (pud_none(*pud)) { >> - addr += PUD_SIZE; >> + addr += PMD_SIZE; >> continue; >> } > > I'm thinking you're right in that there's a bug here, but I'm also > thinking your patch is both incomplete and broken. > > What that code wants to do is skip to the end of the pud, a pmd_size > increase will not do that. And right below this, there's a second > instance of this exact pattern. > > Did I get the below right? Yes, your are right. Thanks, Song > > --- > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c > index b196524759ec..32b20b3cb227 100644 > --- a/arch/x86/mm/pti.c > +++ b/arch/x86/mm/pti.c > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end, > > pud = pud_offset(p4d, addr); > if (pud_none(*pud)) { > + addr &= PUD_MASK; > addr += PUD_SIZE; > continue; > } > > pmd = pmd_offset(pud, addr); > if (pmd_none(*pmd)) { > + addr &= PMD_MASK; > addr += PMD_SIZE; > continue; > }
> On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Tue, 20 Aug 2019, Peter Zijlstra wrote: >> What that code wants to do is skip to the end of the pud, a pmd_size >> increase will not do that. And right below this, there's a second >> instance of this exact pattern. >> >> Did I get the below right? >> >> --- >> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c >> index b196524759ec..32b20b3cb227 100644 >> --- a/arch/x86/mm/pti.c >> +++ b/arch/x86/mm/pti.c >> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end, >> >> pud = pud_offset(p4d, addr); >> if (pud_none(*pud)) { >> + addr &= PUD_MASK; >> addr += PUD_SIZE; > > round_up(addr, PUD_SIZE); I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)". Thanks, Song > > perhaps? > >> continue; >> } >> >> pmd = pmd_offset(pud, addr); >> if (pmd_none(*pmd)) { >> + addr &= PMD_MASK; >> addr += PMD_SIZE; >> continue; >> } >>
On Tue, 2019-08-20 at 13:16 +0200, Thomas Gleixner wrote: > On Tue, 20 Aug 2019, Peter Zijlstra wrote: > > What that code wants to do is skip to the end of the pud, a > > pmd_size > > increase will not do that. And right below this, there's a second > > instance of this exact pattern. > > > > Did I get the below right? > > > > --- > > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c > > index b196524759ec..32b20b3cb227 100644 > > --- a/arch/x86/mm/pti.c > > +++ b/arch/x86/mm/pti.c > > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, > > unsigned long end, > > > > pud = pud_offset(p4d, addr); > > if (pud_none(*pud)) { > > + addr &= PUD_MASK; > > addr += PUD_SIZE; > > round_up(addr, PUD_SIZE); > > perhaps? Won't that keep returning the same address forever if addr & PUD_MASK == 0?
On Tue, 20 Aug 2019, Song Liu wrote: > > On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > On Tue, 20 Aug 2019, Peter Zijlstra wrote: > >> What that code wants to do is skip to the end of the pud, a pmd_size > >> increase will not do that. And right below this, there's a second > >> instance of this exact pattern. > >> > >> Did I get the below right? > >> > >> --- > >> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c > >> index b196524759ec..32b20b3cb227 100644 > >> --- a/arch/x86/mm/pti.c > >> +++ b/arch/x86/mm/pti.c > >> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end, > >> > >> pud = pud_offset(p4d, addr); > >> if (pud_none(*pud)) { > >> + addr &= PUD_MASK; > >> addr += PUD_SIZE; > > > > round_up(addr, PUD_SIZE); > > I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)". Right you are.
On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote: > > On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de> > > wrote: > > > > On Tue, 20 Aug 2019, Peter Zijlstra wrote: > > > What that code wants to do is skip to the end of the pud, a > > > pmd_size > > > increase will not do that. And right below this, there's a second > > > instance of this exact pattern. > > > > > > Did I get the below right? > > > > > > --- > > > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c > > > index b196524759ec..32b20b3cb227 100644 > > > --- a/arch/x86/mm/pti.c > > > +++ b/arch/x86/mm/pti.c > > > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, > > > unsigned long end, > > > > > > pud = pud_offset(p4d, addr); > > > if (pud_none(*pud)) { > > > + addr &= PUD_MASK; > > > addr += PUD_SIZE; > > > > round_up(addr, PUD_SIZE); > > I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)". What does that do if start is less than PMD_SIZE away from the next PUD_SIZE boundary? How about: round_up(addr + 1, PUD_SIZE) ?
On 8/20/19 12:51 AM, Song Liu wrote: > In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because > of this issuse, including PMD for the irq entry table. For a memcache > like workload, this introduces about 4.5x more iTLB-load and about 2.5x > more iTLB-load-misses on a Skylake CPU. I was surprised that this manifests as a performance issue. Usually messing up PTI page table manipulation means you get to experience the jobs of debugging triple faults. But, it makes sense if its this line: /* * Note that this will undo _some_ of the work that * pti_set_kernel_image_nonglobal() did to clear the * global bit. */ pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE); which is restoring the Global bit. *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs and shouldn't have a global kernel image. Could you confirm whether PCIDs are supported on this CPU? > pud = pud_offset(p4d, addr); > if (pud_none(*pud)) { > - addr += PUD_SIZE; > + addr += PMD_SIZE; > continue; > } Did we also bugger up this code: pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) { addr += PMD_SIZE; continue; } if we're on 32-bit and this: #define PTI_LEVEL_KERNEL_IMAGE PTI_CLONE_PTE and we get a hole walking to a non-PMD-aligned address?
> On Aug 20, 2019, at 6:55 AM, Rik van Riel <riel@fb.com> wrote: > > On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote: >>> On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de> >>> wrote: >>> >>> On Tue, 20 Aug 2019, Peter Zijlstra wrote: >>>> What that code wants to do is skip to the end of the pud, a >>>> pmd_size >>>> increase will not do that. And right below this, there's a second >>>> instance of this exact pattern. >>>> >>>> Did I get the below right? >>>> >>>> --- >>>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c >>>> index b196524759ec..32b20b3cb227 100644 >>>> --- a/arch/x86/mm/pti.c >>>> +++ b/arch/x86/mm/pti.c >>>> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, >>>> unsigned long end, >>>> >>>> pud = pud_offset(p4d, addr); >>>> if (pud_none(*pud)) { >>>> + addr &= PUD_MASK; >>>> addr += PUD_SIZE; >>> >>> round_up(addr, PUD_SIZE); >> >> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)". > > What does that do if start is less than PMD_SIZE > away from the next PUD_SIZE boundary? Great point! > > How about: round_up(addr + 1, PUD_SIZE) ? Yes. How about this? =========================== 8< ============================ From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00 2001 From: Song Liu <songliubraving@fb.com> Date: Mon, 19 Aug 2019 23:59:47 -0700 Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr properly Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr. This behavior changes after the 32-bit support: pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because addr may not be PUD_SIZE/PMD_SIZE aligned. Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE in these two cases. Cc: stable@vger.kernel.org # v4.19+ Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Signed-off-by: Song Liu <songliubraving@fb.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> --- arch/x86/mm/pti.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..1337494e22ef 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end, pud = pud_offset(p4d, addr); if (pud_none(*pud)) { - addr += PUD_SIZE; + addr = round_up(addr + 1, PUD_SIZE); continue; } pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) { - addr += PMD_SIZE; + addr = round_up(addr + 1, PMD_SIZE); continue; } -- 2.17.1
> On Aug 20, 2019, at 6:57 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 8/20/19 12:51 AM, Song Liu wrote: >> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because >> of this issuse, including PMD for the irq entry table. For a memcache >> like workload, this introduces about 4.5x more iTLB-load and about 2.5x >> more iTLB-load-misses on a Skylake CPU. > > I was surprised that this manifests as a performance issue. Usually > messing up PTI page table manipulation means you get to experience the > jobs of debugging triple faults. But, it makes sense if its this line: > > /* > * Note that this will undo _some_ of the work that > * pti_set_kernel_image_nonglobal() did to clear the > * global bit. > */ > pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE); > > which is restoring the Global bit. > > *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs > and shouldn't have a global kernel image. Could you confirm whether > PCIDs are supported on this CPU? Yes, pcid is listed in /proc/cpuinfo. Thanks, Song
On 8/20/19 7:14 AM, Song Liu wrote: >> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs >> and shouldn't have a global kernel image. Could you confirm whether >> PCIDs are supported on this CPU? > Yes, pcid is listed in /proc/cpuinfo. So what's going on? Could you confirm exactly which pti_clone_pgtable() is causing you problems? Do you have a theory as to why this manifests as a performance problem rather than a functional one? A diff of these: /sys/kernel/debug/page_tables/current_user /sys/kernel/debug/page_tables/current_kernel before and after your patch might be helpful.
> On Aug 20, 2019, at 7:18 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 8/20/19 7:14 AM, Song Liu wrote: >>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs >>> and shouldn't have a global kernel image. Could you confirm whether >>> PCIDs are supported on this CPU? >> Yes, pcid is listed in /proc/cpuinfo. > > So what's going on? Could you confirm exactly which pti_clone_pgtable() > is causing you problems? Do you have a theory as to why this manifests > as a performance problem rather than a functional one? > > A diff of these: > > /sys/kernel/debug/page_tables/current_user > /sys/kernel/debug/page_tables/current_kernel > > before and after your patch might be helpful. I believe the difference is from the following entries (7 PMDs) Before the patch: current_kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte efi: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte After the patch: current_kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd efi: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_kernel and kernel show same data though. Thanks, Song
> On Aug 20, 2019, at 9:05 AM, Song Liu <songliubraving@fb.com> wrote: > > > >> On Aug 20, 2019, at 7:18 AM, Dave Hansen <dave.hansen@intel.com> wrote: >> >> On 8/20/19 7:14 AM, Song Liu wrote: >>>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs >>>> and shouldn't have a global kernel image. Could you confirm whether >>>> PCIDs are supported on this CPU? >>> Yes, pcid is listed in /proc/cpuinfo. >> >> So what's going on? Could you confirm exactly which pti_clone_pgtable() >> is causing you problems? Do you have a theory as to why this manifests >> as a performance problem rather than a functional one? >> >> A diff of these: >> >> /sys/kernel/debug/page_tables/current_user >> /sys/kernel/debug/page_tables/current_kernel >> >> before and after your patch might be helpful. > > I believe the difference is from the following entries (7 PMDs) > > Before the patch: > > current_kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte > efi: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte > kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte > > > After the patch: > > current_kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd > efi: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd > kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd > > current_kernel and kernel show same data though. A little more details on how I got here. We use huge page for hot text and thus reduces iTLB misses. As we benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more iTLB misses. To figure out the issue, I use a debug patch that dumps page table for a pid. The following are information from the workload pid. For the 4.16 based kernel: host-4.16 # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81a00000-0xffffffff81c00000 2M ro PSE x pmd For the 5.2 based kernel before this patch: host-5.2-before # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the irq entry table; while 4.16 kernel doesn't have it. For the 5.2 based kernel after this patch: host-5.2-after # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD in 4.16 kernel. This further reduces iTLB miss rate Thanks, Song
On Tue, 2019-08-20 at 10:00 -0400, Song Liu wrote: > > From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00 > 2001 > From: Song Liu <songliubraving@fb.com> > Date: Mon, 19 Aug 2019 23:59:47 -0700 > Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr > properly > > Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr. > This behavior changes after the 32-bit support: pti_clone_pgtable() > increases addr by PUD_SIZE for pud_none(*pud) case, and increases > addr by > PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate > because > addr may not be PUD_SIZE/PMD_SIZE aligned. > > Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE > in these two cases. > > Cc: stable@vger.kernel.org # v4.19+ > Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for > 32 bit") > Signed-off-by: Song Liu <songliubraving@fb.com> > Cc: Joerg Roedel <jroedel@suse.de> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Andy Lutomirski <luto@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> This looks like it should do the trick! Reviewed-by: Rik van Riel <riel@surriel.com>
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..5a67c3015f59 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, pud = pud_offset(p4d, addr); if (pud_none(*pud)) { - addr += PUD_SIZE; + addr += PMD_SIZE; continue; }
pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case. This is not accurate because addr may not be PUD_SIZE aligned. In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because of this issuse, including PMD for the irq entry table. For a memcache like workload, this introduces about 4.5x more iTLB-load and about 2.5x more iTLB-load-misses on a Skylake CPU. This patch fixes this issue by adding PMD_SIZE to addr for pud_none() case. Cc: stable@vger.kernel.org # v4.19+ Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Signed-off-by: Song Liu <songliubraving@fb.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> --- arch/x86/mm/pti.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)