[v2,4/6] x86: Add clear_page_nocache

Message ID	1344524583-1096-5-git-send-email-kirill.shutemov@linux.intel.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-sh-owner@vger.kernel.org> X-Original-To: patchwork-linux-sh@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork2.kernel.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by patchwork2.kernel.org (Postfix) with ESMTP id 32CFFDFF7B for <patchwork-linux-sh@patchwork.kernel.org>; Thu, 9 Aug 2012 15:04:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030981Ab2HIPDi (ORCPT <rfc822;patchwork-linux-sh@patchwork.kernel.org>); Thu, 9 Aug 2012 11:03:38 -0400 Received: from mga01.intel.com ([192.55.52.88]:21172 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030978Ab2HIPDg (ORCPT <rfc822;linux-sh@vger.kernel.org>); Thu, 9 Aug 2012 11:03:36 -0400 Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga101.fm.intel.com with ESMTP; 09 Aug 2012 08:03:17 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.77,740,1336374000"; d="scan'208";a="205532849" Received: from blue.fi.intel.com ([10.237.72.50]) by fmsmga002.fm.intel.com with ESMTP; 09 Aug 2012 08:03:11 -0700 Received: by blue.fi.intel.com (Postfix, from userid 1000) id B897BE0085; Thu, 9 Aug 2012 18:03:14 +0300 (EEST) From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> To: linux-mm@kvack.org Cc: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>, x86@kernel.org, Andi Kleen <ak@linux.intel.com>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Tim Chen <tim.c.chen@linux.intel.com>, Alex Shi <alex.shu@intel.com>, Jan Beulich <jbeulich@novell.com>, Robert Richter <robert.richter@amd.com>, Andy Lutomirski <luto@amacapital.net>, Andrew Morton <akpm@linux-foundation.org>, Andrea Arcangeli <aarcange@redhat.com>, Johannes Weiner <hannes@cmpxchg.org>, Hugh Dickins <hughd@google.com>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>, Mel Gorman <mgorman@suse.de>, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-mips@linux-mips.org, linux-sh@vger.kernel.org, sparclinux@vger.kernel.org Subject: [PATCH v2 4/6] x86: Add clear_page_nocache Date: Thu, 9 Aug 2012 18:03:01 +0300 Message-Id: <1344524583-1096-5-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1344524583-1096-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1344524583-1096-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-sh-owner@vger.kernel.org Precedence: bulk List-ID: <linux-sh.vger.kernel.org> X-Mailing-List: linux-sh@vger.kernel.org

Kirill A . Shutemov Aug. 9, 2012, 3:03 p.m. UTC

From: Andi Kleen <ak@linux.intel.com>

Add a cache avoiding version of clear_page. Straight forward integer variant
of the existing 64bit clear_page, for both 32bit and 64bit.

Also add the necessary glue for highmem including a layer that non cache
coherent architectures that use the virtual address for flushing can
hook in. This is not needed on x86 of course.

If an architecture wants to provide cache avoiding version of clear_page
it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/page.h          |    2 ++
 arch/x86/include/asm/string_32.h     |    5 +++++
 arch/x86/include/asm/string_64.h     |    5 +++++
 arch/x86/lib/Makefile                |    1 +
 arch/x86/lib/clear_page_nocache_32.S |   30 ++++++++++++++++++++++++++++++
 arch/x86/lib/clear_page_nocache_64.S |   29 +++++++++++++++++++++++++++++
 arch/x86/mm/fault.c                  |    7 +++++++
 7 files changed, 79 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_nocache_32.S
 create mode 100644 arch/x86/lib/clear_page_nocache_64.S

Jan Beulich Aug. 9, 2012, 3:22 p.m. UTC | #1

>>> On 09.08.12 at 17:03, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Add a cache avoiding version of clear_page. Straight forward integer variant
> of the existing 64bit clear_page, for both 32bit and 64bit.

While on 64-bit this is fine, I fail to see how you avoid using the
SSE2 instruction on non-SSE2 systems.

> Also add the necessary glue for highmem including a layer that non cache
> coherent architectures that use the virtual address for flushing can
> hook in. This is not needed on x86 of course.
> 
> If an architecture wants to provide cache avoiding version of clear_page
> it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
> clear_page_nocache() and clear_user_highpage_nocache().
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/include/asm/page.h          |    2 ++
>  arch/x86/include/asm/string_32.h     |    5 +++++
>  arch/x86/include/asm/string_64.h     |    5 +++++
>  arch/x86/lib/Makefile                |    1 +
>  arch/x86/lib/clear_page_nocache_32.S |   30 ++++++++++++++++++++++++++++++
>  arch/x86/lib/clear_page_nocache_64.S |   29 +++++++++++++++++++++++++++++

Couldn't this more reasonably go into clear_page_{32,64}.S?

>  arch/x86/mm/fault.c                  |    7 +++++++
>  7 files changed, 79 insertions(+), 0 deletions(-)
>  create mode 100644 arch/x86/lib/clear_page_nocache_32.S
>  create mode 100644 arch/x86/lib/clear_page_nocache_64.S
>...
>--- /dev/null
>+++ b/arch/x86/lib/clear_page_nocache_32.S
>@@ -0,0 +1,30 @@
>+#include <linux/linkage.h>
>+#include <asm/dwarf2.h>
>+
>+/*
>+ * Zero a page avoiding the caches
>+ * rdi	page

Wrong comment.

>+ */
>+ENTRY(clear_page_nocache)
>+	CFI_STARTPROC
>+	mov    %eax,%edi

You need to pick a different register here (e.g. %edx), since
%edi has to be preserved by all functions called from C.

>+	xorl   %eax,%eax
>+	movl   $4096/64,%ecx
>+	.p2align 4
>+.Lloop:
>+	decl	%ecx
>+#define PUT(x) movnti %eax,x*8(%edi) ; movnti %eax,x*8+4(%edi)

Is doing twice as much unrolling as on 64-bit really worth it?

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin Aug. 9, 2012, 3:23 p.m. UTC | #2

On 08/09/2012 08:03 AM, Kirill A. Shutemov wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> Add a cache avoiding version of clear_page. Straight forward integer variant
> of the existing 64bit clear_page, for both 32bit and 64bit.
>
> Also add the necessary glue for highmem including a layer that non cache
> coherent architectures that use the virtual address for flushing can
> hook in. This is not needed on x86 of course.
>
> If an architecture wants to provide cache avoiding version of clear_page
> it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
> clear_page_nocache() and clear_user_highpage_nocache().
>

Compile failure:

/home/hpa/kernel/tip.x86-mm/arch/x86/mm/fault.c: In function 
‘clear_user_highpage_nocache’:
/home/hpa/kernel/tip.x86-mm/arch/x86/mm/fault.c:1215:30: error: 
‘KM_USER0’ undeclared (first use in this function)
/home/hpa/kernel/tip.x86-mm/arch/x86/mm/fault.c:1215:30: note: each 
undeclared identifier is reported only once for each function it appears in
/home/hpa/kernel/tip.x86-mm/arch/x86/mm/fault.c:1215:2: error: too many 
arguments to function ‘kmap_atomic’
In file included from 
/home/hpa/kernel/tip.x86-mm/include/linux/pagemap.h:10:0,
                  from 
/home/hpa/kernel/tip.x86-mm/include/linux/mempolicy.h:70,
                  from 
/home/hpa/kernel/tip.x86-mm/include/linux/hugetlb.h:15,
                  from /home/hpa/kernel/tip.x86-mm/arch/x86/mm/fault.c:14:
/home/hpa/kernel/tip.x86-mm/include/linux/highmem.h:66:21: note: 
declared here
make[4]: *** [arch/x86/mm/fault.o] Error 1
make[3]: *** [arch/x86/mm] Error 2
make[2]: *** [arch/x86] Error 2
make[1]: *** [sub-make] Error 2
make[1]: Leaving directory `/home/hpa/kernel/tip.x86-mm'

This happens on *all* my test configurations, including both x86-64 and 
i386 allyesconfig.  I suspect your patchset base is stale.

	-hpa

Kirill A . Shutemov Aug. 13, 2012, 11:43 a.m. UTC | #3

On Thu, Aug 09, 2012 at 04:22:04PM +0100, Jan Beulich wrote:
> >>> On 09.08.12 at 17:03, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

...

> > ---
> >  arch/x86/include/asm/page.h          |    2 ++
> >  arch/x86/include/asm/string_32.h     |    5 +++++
> >  arch/x86/include/asm/string_64.h     |    5 +++++
> >  arch/x86/lib/Makefile                |    1 +
> >  arch/x86/lib/clear_page_nocache_32.S |   30 ++++++++++++++++++++++++++++++
> >  arch/x86/lib/clear_page_nocache_64.S |   29 +++++++++++++++++++++++++++++
> 
> Couldn't this more reasonably go into clear_page_{32,64}.S?

We don't have clear_page_32.S.

> >+	xorl   %eax,%eax
> >+	movl   $4096/64,%ecx
> >+	.p2align 4
> >+.Lloop:
> >+	decl	%ecx
> >+#define PUT(x) movnti %eax,x*8(%edi) ; movnti %eax,x*8+4(%edi)
> 
> Is doing twice as much unrolling as on 64-bit really worth it?

Moving 64 bytes per cycle is faster on Sandy Bridge, but slower on
Westmere. Any preference? ;)

Westmere:

 Performance counter stats for './test_unroll32' (20 runs):

      31498.420608 task-clock                #    0.998 CPUs utilized            ( +-  0.25% )
                40 context-switches          #    0.001 K/sec                    ( +-  1.40% )
                 0 CPU-migrations            #    0.000 K/sec                    ( +-100.00% )
                89 page-faults               #    0.003 K/sec                    ( +-  0.13% )
    74,728,231,935 cycles                    #    2.372 GHz                      ( +-  0.25% ) [83.34%]
    53,789,969,009 stalled-cycles-frontend   #   71.98% frontend cycles idle     ( +-  0.35% ) [83.33%]
    41,681,014,054 stalled-cycles-backend    #   55.78% backend  cycles idle     ( +-  0.43% ) [66.67%]
    37,992,733,278 instructions              #    0.51  insns per cycle
                                             #    1.42  stalled cycles per insn  ( +-  0.05% ) [83.33%]
     3,561,376,245 branches                  #  113.065 M/sec                    ( +-  0.05% ) [83.33%]
        27,182,795 branch-misses             #    0.76% of all branches          ( +-  0.06% ) [83.33%]

      31.558545812 seconds time elapsed                                          ( +-  0.25% )

 Performance counter stats for './test_unroll64' (20 runs):

      31564.753623 task-clock                #    0.998 CPUs utilized            ( +-  0.19% )
                39 context-switches          #    0.001 K/sec                    ( +-  0.40% )
                 0 CPU-migrations            #    0.000 K/sec
                90 page-faults               #    0.003 K/sec                    ( +-  0.12% )
    74,886,045,192 cycles                    #    2.372 GHz                      ( +-  0.19% ) [83.33%]
    57,477,323,995 stalled-cycles-frontend   #   76.75% frontend cycles idle     ( +-  0.26% ) [83.34%]
    44,548,142,150 stalled-cycles-backend    #   59.49% backend  cycles idle     ( +-  0.31% ) [66.67%]
    32,940,027,099 instructions              #    0.44  insns per cycle
                                             #    1.74  stalled cycles per insn  ( +-  0.05% ) [83.34%]
     1,884,944,093 branches                  #   59.717 M/sec                    ( +-  0.05% ) [83.32%]
         1,027,135 branch-misses             #    0.05% of all branches          ( +-  0.56% ) [83.34%]

      31.621001407 seconds time elapsed                                          ( +-  0.19% )

Sandy Bridge:

 Performance counter stats for './test_unroll32' (20 runs):

       8578.382891 task-clock                #    0.997 CPUs utilized            ( +-  0.08% )
                15 context-switches          #    0.000 M/sec                    ( +-  2.97% )
                 0 CPU-migrations            #    0.000 M/sec
                84 page-faults               #    0.000 M/sec                    ( +-  0.13% )
    29,154,476,597 cycles                    #    3.399 GHz                      ( +-  0.08% ) [83.33%]
    11,851,215,147 stalled-cycles-frontend   #   40.65% frontend cycles idle     ( +-  0.20% ) [83.33%]
     1,530,172,593 stalled-cycles-backend    #    5.25% backend  cycles idle     ( +-  1.44% ) [66.67%]
    37,915,778,094 instructions              #    1.30  insns per cycle
                                             #    0.31  stalled cycles per insn  ( +-  0.00% ) [83.34%]
     3,590,533,447 branches                  #  418.556 M/sec                    ( +-  0.01% ) [83.35%]
        26,500,765 branch-misses             #    0.74% of all branches          ( +-  0.01% ) [83.34%]

       8.604638449 seconds time elapsed                                          ( +-  0.08% )

 Performance counter stats for './test_unroll64' (20 runs):

       8463.789963 task-clock                #    0.997 CPUs utilized            ( +-  0.07% )
                14 context-switches          #    0.000 M/sec                    ( +-  1.70% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
                85 page-faults               #    0.000 M/sec                    ( +-  0.12% )
    28,763,328,688 cycles                    #    3.398 GHz                      ( +-  0.07% ) [83.32%]
    13,517,462,952 stalled-cycles-frontend   #   47.00% frontend cycles idle     ( +-  0.14% ) [83.33%]
     1,356,208,859 stalled-cycles-backend    #    4.72% backend  cycles idle     ( +-  1.42% ) [66.68%]
    32,885,492,141 instructions              #    1.14  insns per cycle
                                             #    0.41  stalled cycles per insn  ( +-  0.00% ) [83.34%]
     1,912,094,072 branches                  #  225.915 M/sec                    ( +-  0.02% ) [83.34%]
           305,896 branch-misses             #    0.02% of all branches          ( +-  1.05% ) [83.33%]

       8.488304839 seconds time elapsed                                          ( +-  0.07% )

$ cat test.c
#include <stdio.h>
#include <sys/mman.h>

#define SIZE 1024*1024*1024

void clear_page_nocache_sse2(void *page) __attribute__((regparm(1)));

int main(int argc, char** argv)
{
        char *p;
        unsigned long i, j;

        p = mmap(NULL, SIZE, PROT_WRITE|PROT_READ,
                        MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
        for(j = 0; j < 100; j++) {
                for(i = 0; i < SIZE; i += 4096) {
                        clear_page_nocache_sse2(p + i);
                }
        }

        return 0;
}
$ cat clear_page_nocache_unroll32.S
.globl clear_page_nocache_sse2
.align 4,0x90
clear_page_nocache_sse2:
.cfi_startproc
        mov    %eax,%edx
        xorl   %eax,%eax
        movl   $4096/32,%ecx
        .p2align 4
.Lloop_sse2:
        decl    %ecx
#define PUT(x) movnti %eax,x*4(%edx)
        PUT(0)
        PUT(1)
        PUT(2)
        PUT(3)
        PUT(4)
        PUT(5)
        PUT(6)
        PUT(7)
#undef PUT
        lea     32(%edx),%edx
        jnz     .Lloop_sse2
        nop
        ret
.cfi_endproc
.type clear_page_nocache_sse2, @function
.size clear_page_nocache_sse2, .-clear_page_nocache_sse2
$ cat clear_page_nocache_unroll64.S
.globl clear_page_nocache_sse2
.align 4,0x90
clear_page_nocache_sse2:
.cfi_startproc
        mov    %eax,%edx
        xorl   %eax,%eax
        movl   $4096/64,%ecx
        .p2align 4
.Lloop_sse2:
        decl    %ecx
#define PUT(x) movnti %eax,x*8(%edx) ; movnti %eax,x*8+4(%edx)
        PUT(0)
        PUT(1)
        PUT(2)
        PUT(3)
        PUT(4)
        PUT(5)
        PUT(6)
        PUT(7)
#undef PUT
        lea     64(%edx),%edx
        jnz     .Lloop_sse2
        nop
        ret
.cfi_endproc
.type clear_page_nocache_sse2, @function
.size clear_page_nocache_sse2, .-clear_page_nocache_sse2

Jan Beulich Aug. 13, 2012, 12:02 p.m. UTC | #4

>>> On 13.08.12 at 13:43, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> On Thu, Aug 09, 2012 at 04:22:04PM +0100, Jan Beulich wrote:
>> >>> On 09.08.12 at 17:03, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>  wrote:
> 
> ...
> 
>> > ---
>> >  arch/x86/include/asm/page.h          |    2 ++
>> >  arch/x86/include/asm/string_32.h     |    5 +++++
>> >  arch/x86/include/asm/string_64.h     |    5 +++++
>> >  arch/x86/lib/Makefile                |    1 +
>> >  arch/x86/lib/clear_page_nocache_32.S |   30 ++++++++++++++++++++++++++++++
>> >  arch/x86/lib/clear_page_nocache_64.S |   29 +++++++++++++++++++++++++++++
>> 
>> Couldn't this more reasonably go into clear_page_{32,64}.S?
> 
> We don't have clear_page_32.S.

Sure, but you're introducing a file anyway. Fold the new code into
the existing file for 64-bit, and create a new, similarly named one
for 32-bit.

>> >+	xorl   %eax,%eax
>> >+	movl   $4096/64,%ecx
>> >+	.p2align 4
>> >+.Lloop:
>> >+	decl	%ecx
>> >+#define PUT(x) movnti %eax,x*8(%edi) ; movnti %eax,x*8+4(%edi)
>> 
>> Is doing twice as much unrolling as on 64-bit really worth it?
> 
> Moving 64 bytes per cycle is faster on Sandy Bridge, but slower on
> Westmere. Any preference? ;)

If it's not a clear win, I'd favor the 8-stores-per-cycle variant,
matching x86-64.

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen Aug. 13, 2012, 4:27 p.m. UTC | #5

> Moving 64 bytes per cycle is faster on Sandy Bridge, but slower on
> Westmere. Any preference? ;)

You have to be careful with these benchmarks.

- You need to make sure the data is cache cold, cache hot is misleading.
- The numbers can change if you have multiple CPUs doing this in parallel.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Borislav Petkov Aug. 13, 2012, 5:04 p.m. UTC | #6

On Mon, Aug 13, 2012 at 02:43:34PM +0300, Kirill A. Shutemov wrote:
> $ cat test.c
> #include <stdio.h>
> #include <sys/mman.h>
> 
> #define SIZE 1024*1024*1024
> 
> void clear_page_nocache_sse2(void *page) __attribute__((regparm(1)));
> 
> int main(int argc, char** argv)
> {
>         char *p;
>         unsigned long i, j;
> 
>         p = mmap(NULL, SIZE, PROT_WRITE|PROT_READ,
>                         MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
>         for(j = 0; j < 100; j++) {
>                 for(i = 0; i < SIZE; i += 4096) {
>                         clear_page_nocache_sse2(p + i);
>                 }
>         }
> 
>         return 0;
> }
> $ cat clear_page_nocache_unroll32.S
> .globl clear_page_nocache_sse2
> .align 4,0x90
> clear_page_nocache_sse2:
> .cfi_startproc
>         mov    %eax,%edx
>         xorl   %eax,%eax
>         movl   $4096/32,%ecx
>         .p2align 4
> .Lloop_sse2:
>         decl    %ecx
> #define PUT(x) movnti %eax,x*4(%edx)
>         PUT(0)
>         PUT(1)
>         PUT(2)
>         PUT(3)
>         PUT(4)
>         PUT(5)
>         PUT(6)
>         PUT(7)
> #undef PUT
>         lea     32(%edx),%edx
>         jnz     .Lloop_sse2
>         nop
>         ret
> .cfi_endproc
> .type clear_page_nocache_sse2, @function
> .size clear_page_nocache_sse2, .-clear_page_nocache_sse2
> $ cat clear_page_nocache_unroll64.S
> .globl clear_page_nocache_sse2
> .align 4,0x90
> clear_page_nocache_sse2:
> .cfi_startproc
>         mov    %eax,%edx

This must still be the 32-bit version becaue it segfaults here. Here's
why:

mmap above gives a ptr which, on 64-bit, is larger than 32-bit, i.e. it
looks like 0x7fffxxxxx000, i.e. starting from top of userspace.

Now, the mov above truncates that ptr and the thing segfaults.

Doing s/edx/rdx/g fixes it though.

Thanks.

Kirill A. Shutemov Aug. 13, 2012, 7:07 p.m. UTC | #7

On Mon, Aug 13, 2012 at 07:04:02PM +0200, Borislav Petkov wrote:
> On Mon, Aug 13, 2012 at 02:43:34PM +0300, Kirill A. Shutemov wrote:
> > $ cat test.c
> > #include <stdio.h>
> > #include <sys/mman.h>
> > 
> > #define SIZE 1024*1024*1024
> > 
> > void clear_page_nocache_sse2(void *page) __attribute__((regparm(1)));
> > 
> > int main(int argc, char** argv)
> > {
> >         char *p;
> >         unsigned long i, j;
> > 
> >         p = mmap(NULL, SIZE, PROT_WRITE|PROT_READ,
> >                         MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
> >         for(j = 0; j < 100; j++) {
> >                 for(i = 0; i < SIZE; i += 4096) {
> >                         clear_page_nocache_sse2(p + i);
> >                 }
> >         }
> > 
> >         return 0;
> > }
> > $ cat clear_page_nocache_unroll32.S
> > .globl clear_page_nocache_sse2
> > .align 4,0x90
> > clear_page_nocache_sse2:
> > .cfi_startproc
> >         mov    %eax,%edx
> >         xorl   %eax,%eax
> >         movl   $4096/32,%ecx
> >         .p2align 4
> > .Lloop_sse2:
> >         decl    %ecx
> > #define PUT(x) movnti %eax,x*4(%edx)
> >         PUT(0)
> >         PUT(1)
> >         PUT(2)
> >         PUT(3)
> >         PUT(4)
> >         PUT(5)
> >         PUT(6)
> >         PUT(7)
> > #undef PUT
> >         lea     32(%edx),%edx
> >         jnz     .Lloop_sse2
> >         nop
> >         ret
> > .cfi_endproc
> > .type clear_page_nocache_sse2, @function
> > .size clear_page_nocache_sse2, .-clear_page_nocache_sse2
> > $ cat clear_page_nocache_unroll64.S
> > .globl clear_page_nocache_sse2
> > .align 4,0x90
> > clear_page_nocache_sse2:
> > .cfi_startproc
> >         mov    %eax,%edx
> 
> This must still be the 32-bit version becaue it segfaults here.

Yes, it's test for 32-bit version.

[v2,4/6] x86: Add clear_page_nocache

Commit Message

Comments

Patch