A few other cache related optimizations for Cortex-A9.

Message ID	CAN_5kQBszi=hV1RVjyKO6gOhOuymGjsMwLk6ORaWpkaL-4USxA@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) by demeter1.kernel.org (8.14.4/8.14.4) with ESMTP id p663i9sg010636 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for <patchwork-linux-arm@patchwork.kernel.org>; Wed, 6 Jul 2011 03:44:29 GMT MIME-Version: 1.0 Date: Tue, 5 Jul 2011 20:14:57 -0700 Message-ID: <CAN_5kQBszi=hV1RVjyKO6gOhOuymGjsMwLk6ORaWpkaL-4USxA@mail.gmail.com> Subject: A few other cache related optimizations for Cortex-A9. From: heechul Yun <heechul@illinois.edu> To: Catalin Marinas <catalin.marinas@arm.com>, Russell King - ARM Linux <linux@arm.linux.org.uk> summary: Content analysis details: (-0.7 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at http://www.dnswl.org/, low trust [209.85.213.177 listed in list.dnswl.org] 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (heechul.yun[at]gmail.com) 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org> Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: linux-arm-kernel-bounces@lists.infradead.org Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Message ID

CAN_5kQBszi=hV1RVjyKO6gOhOuymGjsMwLk6ORaWpkaL-4USxA@mail.gmail.com (mailing list archive)

State

New, archived

Headers

MIME-Version: 1.0
Date: Tue, 5 Jul 2011 20:14:57 -0700
Message-ID: <CAN_5kQBszi=hV1RVjyKO6gOhOuymGjsMwLk6ORaWpkaL-4USxA@mail.gmail.com>
Subject: A few other cache related optimizations for Cortex-A9.
From: heechul Yun <heechul@illinois.edu>
To: Catalin Marinas <catalin.marinas@arm.com>,
	Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: "linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>
Precedence: list
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: linux-arm-kernel-bounces@lists.infradead.org
Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Commit Message

heechul Yun July 6, 2011, 3:14 a.m. UTC

I found a few other places which, I believe, are not necessary for Cortex-A9.


Creating page tables also do not need to clean cache-line because of
the same reason as above.
This patch improves lmbench3 (fork/exec/shell) performance by 10%~20%
in my test.

I think above two patches work for least Cortex-A9 although I am not
sure the use of CONFIG_CPU_CACHE_V7 is appropriate.

Thanks

Comments

Catalin Marinas July 6, 2011, 8:56 a.m. UTC | #1

On Wed, Jul 06, 2011 at 04:14:57AM +0100, heechul Yun wrote:
> I found a few other places which, I believe, are not necessary for Cortex-A9.
> 
> diff --git a/arch/arm/mm/copypage-v6.c b/arch/arm/mm/copypage-v6.c
> index bdba6c6..6d5a847 100644
> --- a/arch/arm/mm/copypage-v6.c
> +++ b/arch/arm/mm/copypage-v6.c
> @@ -41,7 +41,9 @@ static void v6_copy_user_highpage_nonaliasing(struct page *to,
>         kfrom = kmap_atomic(from, KM_USER0);
>         kto = kmap_atomic(to, KM_USER1);
>         copy_page(kto, kfrom);
> +#ifndef CONFIG_CPU_CACHE_V7
>         __cpuc_flush_dcache_area(kto, PAGE_SIZE);
> +#endif
>         kunmap_atomic(kto, KM_USER1);
>         kunmap_atomic(kfrom, KM_USER0);
>  }
> 
> On handling COW page fault, the above function is called to copy the
> page content of the parent to a newly allocate page frame for the
> child. Again, since D cache of A9 is PIPT, we do not need to flush the
> page as in x86. This modification improves lmbench (fork/exec/shell)
> performance by 4-6%.

See commit 115b2247 introducing this. We indeed have a PIPT like cache
on A9 but it is a Harvard architecture with separate I and D caches. It
happened in the past that we got a COW for text page and the I and D
cache became incoherent. Since then, the dynamic linker has been fixed
and no longer causes this. We could add a check for VM_EXEC in
vma->vm_flags.

But I wonder whether we still need this flush after commit c0177800
where we assume that a new page cache page has dirty D-cache (and we
later flush the caches via set_pte_at).

> I think above two patches work for least Cortex-A9 although I am not
> sure the use of CONFIG_CPU_CACHE_V7 is appropriate.

We need to check the ID_MMFR1 register as there are other ARMv7 cores
that cannot do page table walks in the L1 cache.

Russell King - ARM Linux July 6, 2011, 7:30 p.m. UTC | #2

On Wed, Jul 06, 2011 at 09:56:56AM +0100, Catalin Marinas wrote:
> On Wed, Jul 06, 2011 at 04:14:57AM +0100, heechul Yun wrote:
> > I found a few other places which, I believe, are not necessary for Cortex-A9.
> > 
> > diff --git a/arch/arm/mm/copypage-v6.c b/arch/arm/mm/copypage-v6.c
> > index bdba6c6..6d5a847 100644
> > --- a/arch/arm/mm/copypage-v6.c
> > +++ b/arch/arm/mm/copypage-v6.c
> > @@ -41,7 +41,9 @@ static void v6_copy_user_highpage_nonaliasing(struct page *to,
> >         kfrom = kmap_atomic(from, KM_USER0);
> >         kto = kmap_atomic(to, KM_USER1);
> >         copy_page(kto, kfrom);
> > +#ifndef CONFIG_CPU_CACHE_V7
> >         __cpuc_flush_dcache_area(kto, PAGE_SIZE);
> > +#endif
> >         kunmap_atomic(kto, KM_USER1);
> >         kunmap_atomic(kfrom, KM_USER0);
> >  }
> > 
> > On handling COW page fault, the above function is called to copy the
> > page content of the parent to a newly allocate page frame for the
> > child. Again, since D cache of A9 is PIPT, we do not need to flush the
> > page as in x86. This modification improves lmbench (fork/exec/shell)
> > performance by 4-6%.
> 
> See commit 115b2247 introducing this. We indeed have a PIPT like cache
> on A9 but it is a Harvard architecture with separate I and D caches. It
> happened in the past that we got a COW for text page and the I and D
> cache became incoherent. Since then, the dynamic linker has been fixed
> and no longer causes this. We could add a check for VM_EXEC in
> vma->vm_flags.
> 
> But I wonder whether we still need this flush after commit c0177800
> where we assume that a new page cache page has dirty D-cache (and we
> later flush the caches via set_pte_at).

I don't think we need that flush there after c0177800 either.  I/D
coherency implies that pte_exec() is set, which will get us through
to the checking of PG_arch_1 in __sync_icache_dcache(), where we'll
call __flush_dcache_page for this page.

We don't need this flush anymore, so let's simply kill it outright.

Heechul (sorry, is that the correct way of addressing you?) could
you please submit a patch removing the __cpuc_flush_dcache_area()
from v6_copy_user_highpage_nonaliasing() entirely please?

Thanks.

heechul Yun July 7, 2011, 2:53 a.m. UTC | #3

>
> We don't need this flush anymore, so let's simply kill it outright.
>
> Heechul (sorry, is that the correct way of addressing you?) could
> you please submit a patch removing the __cpuc_flush_dcache_area()
> from v6_copy_user_highpage_nonaliasing() entirely please?
>

I sent the patch

Thanks

Heechul

diff --git a/arch/arm/mm/copypage-v6.c b/arch/arm/mm/copypage-v6.c
index bdba6c6..6d5a847 100644
--- a/arch/arm/mm/copypage-v6.c
+++ b/arch/arm/mm/copypage-v6.c
@@ -41,7 +41,9 @@  static void v6_copy_user_highpage_nonaliasing(struct page *to,
        kfrom = kmap_atomic(from, KM_USER0);
        kto = kmap_atomic(to, KM_USER1);
        copy_page(kto, kfrom);
+#ifndef CONFIG_CPU_CACHE_V7
        __cpuc_flush_dcache_area(kto, PAGE_SIZE);
+#endif
        kunmap_atomic(kto, KM_USER1);
        kunmap_atomic(kfrom, KM_USER0);
 }

On handling COW page fault, the above function is called to copy the
page content of the parent to a newly allocate page frame for the
child. Again, since D cache of A9 is PIPT, we do not need to flush the
page as in x86. This modification improves lmbench (fork/exec/shell)
performance by 4-6%.

diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h
index b12cc98..bff9858 100644
--- a/arch/arm/include/asm/pgalloc.h
+++ b/arch/arm/include/asm/pgalloc.h
@@ -61,7 +61,9 @@  pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)

        pte = (pte_t *)__get_free_page(PGALLOC_GFP);
        if (pte) {
+#if !CONFIG_CPU_CACHE_V7
                clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE);
+#endif
                pte += PTRS_PER_PTE;
        }

@@ -81,7 +83,9 @@  pte_alloc_one(struct mm_struct *mm, unsigned long addr)
        if (pte) {
                if (!PageHighMem(pte)) {
                        void *page = page_address(pte);
+#if !CONFIG_CPU_CACHE_V7
                        clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE);
+#endif
                }
                pgtable_page_ctor(pte);
        }
diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c
index be5f58e..343df1b 100644
--- a/arch/arm/mm/pgd.c
+++ b/arch/arm/mm/pgd.c
@@ -41,8 +41,9 @@  pgd_t *get_pgd_slow(struct mm_struct *mm)
        memcpy(new_pgd + FIRST_KERNEL_PGD_NR, init_pgd + FIRST_KERNEL_PGD_NR,
                       (PTRS_PER_PGD - FIRST_KERNEL_PGD_NR) * sizeof(pgd_t));

+#if !CONFIG_CPU_CACHE_V7
        clean_dcache_area(new_pgd, PTRS_PER_PGD * sizeof(pgd_t));
-
+#endif
        if (!vectors_high()) {
                /*
                 * On ARM, first page must always be allocated since it

A few other cache related optimizations for Cortex-A9.

Commit Message

Comments

Patch