diff mbox series

[v2] arm64: mm: force write fault for atomic RMW instructions

Message ID 20240520165636.802268-1-yang@os.amperecomputing.com (mailing list archive)
State New, archived
Headers show
Series [v2] arm64: mm: force write fault for atomic RMW instructions | expand

Commit Message

Yang Shi May 20, 2024, 4:56 p.m. UTC
The atomic RMW instructions, for example, ldadd, actually does load +
add + store in one instruction, it will trigger two page faults per the
ARM64 architecture spec, the first fault is a read fault, the second
fault is a write fault.

Some applications use atomic RMW instructions to populate memory, for
example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
at launch time) between v18 and v22 in order to permit use of memory
concurrently with pretouch.

But the double page fault has some problems:

1. Noticeable TLB overhead.  The kernel actually installs zero page with
   readonly PTE for the read fault.  The write fault will trigger a
   write-protection fault (CoW).  The CoW will allocate a new page and
   make the PTE point to the new page, this needs TLB invalidations.  The
   tlb invalidation and the mandatory memory barriers may incur
   significant overhead, particularly on the machines with many cores.

2. Break up huge pages.  If THP is on the read fault will install huge
   zero pages.  The later CoW will break up the huge page and allocate
   base pages instead of huge page.  The applications have to rely on
   khugepaged (kernel thread) to collapse huge pages asynchronously.
   This also incurs noticeable performance penalty.

3. 512x page faults with huge page.  Due to #2, the applications have to
   have page faults for every 4K area for the write, this makes the speed
   up by using huge page actually gone.

So it sounds pointless to have two page faults since we know the memory
will be definitely written very soon.  Forcing write fault for atomic RMW
instruction makes some sense and it can solve the aforementioned problems:

Firstly, it just allocates zero'ed page, no tlb invalidation and memory
barriers anymore.
Secondly, it can populate writable huge pages in the first place and
don't break them up.  Just one page fault is needed for 2M area instrad
of 512 faults and also save cpu time by not using khugepaged.

A simple micro benchmark which populates 1G memory shows the number of
page faults is reduced by half and the time spent by system is reduced
by 60% on a VM running on Ampere Altra platform.

And the benchmark for anonymous read fault on 1G memory, file read fault
on 1G file (cold page cache and warm page cache) don't show noticeable
regression.

Some other architectures also have code inspection in page fault path,
for example, SPARC and x86.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/insn.h |  8 ++++++++
 arch/arm64/mm/fault.c         | 38 +++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

v2: 1. Made commit log more precise per Anshuman and Catalin
    2. Made pagefault_disable/enable window narrower per Anshuman
    3. Covered CAS and CASP variants per Catalin
    4. Put instruction fetching and decoding into a helper function and
       take into account endianess per Catalin
    5. Don't fetch and decode insn for 32 bit mode (compat) per Catalin
    6. More performance tests and exec-only test per Anshuman and Catalin

Comments

Catalin Marinas May 23, 2024, 5:07 p.m. UTC | #1
On Mon, May 20, 2024 at 09:56:36AM -0700, Yang Shi wrote:
> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index db1aeacd4cd9..1cc73664fc55 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
>   * "-" means "don't care"
>   */
>  __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
> +__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)

While this class includes all atomics that currently require write
permission, there's some unallocated space in this range and we don't
know what future architecture versions may introduce. Unfortunately we
need to check each individual atomic op in this class (not sure what the
overhead will be).
Christoph Lameter (Ampere) May 23, 2024, 6:09 p.m. UTC | #2
On Thu, 23 May 2024, Catalin Marinas wrote:

> On Mon, May 20, 2024 at 09:56:36AM -0700, Yang Shi wrote:
>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>> index db1aeacd4cd9..1cc73664fc55 100644
>> --- a/arch/arm64/include/asm/insn.h
>> +++ b/arch/arm64/include/asm/insn.h
>> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
>>   * "-" means "don't care"
>>   */
>>  __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
>> +__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)
>
> While this class includes all atomics that currently require write
> permission, there's some unallocated space in this range and we don't
> know what future architecture versions may introduce. Unfortunately we
> need to check each individual atomic op in this class (not sure what the
> overhead will be).

Can you tell us which bits or pattern is not allocated? Maybe we can 
exclude that from the pattern.
Catalin Marinas May 23, 2024, 7 p.m. UTC | #3
On Thu, May 23, 2024 at 11:09:11AM -0700, Christoph Lameter (Ampere) wrote:
> On Thu, 23 May 2024, Catalin Marinas wrote:
> 
> > On Mon, May 20, 2024 at 09:56:36AM -0700, Yang Shi wrote:
> > > diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> > > index db1aeacd4cd9..1cc73664fc55 100644
> > > --- a/arch/arm64/include/asm/insn.h
> > > +++ b/arch/arm64/include/asm/insn.h
> > > @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
> > >   * "-" means "don't care"
> > >   */
> > >  __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
> > > +__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)
> > 
> > While this class includes all atomics that currently require write
> > permission, there's some unallocated space in this range and we don't
> > know what future architecture versions may introduce. Unfortunately we
> > need to check each individual atomic op in this class (not sure what the
> > overhead will be).
> 
> Can you tell us which bits or pattern is not allocated? Maybe we can exclude
> that from the pattern.

Yes, it may be easier to exclude those patterns. See the Arm ARM K.a
section C4.1.94.29 (page 791).
Christoph Lameter (Ampere) May 23, 2024, 7:43 p.m. UTC | #4
On Thu, 23 May 2024, Catalin Marinas wrote:

>>> While this class includes all atomics that currently require write
>>> permission, there's some unallocated space in this range and we don't
>>> know what future architecture versions may introduce. Unfortunately we
>>> need to check each individual atomic op in this class (not sure what the
>>> overhead will be).
>>
>> Can you tell us which bits or pattern is not allocated? Maybe we can exclude
>> that from the pattern.
>
> Yes, it may be easier to exclude those patterns. See the Arm ARM K.a
> section C4.1.94.29 (page 791).

Hmmm. We could consult an exception table once the pattern matches to 
reduce the overhead.

However, the harm done I think is acceptable even if we leave things as 
is. In the worst case we create unnecesssary write fault processing for an 
"atomic op" that does not need write access. Also: Why would it need to be 
atomic if it does not write???

It is more likely that new atomic ops are added that require write 
permissions. Those will then just work. Otherwise we would need to 
maintain an exception table of unallocated instructions that would then 
have to shrink depending on new atomics added.

The ultimate solution would be to change the spec so that arm processors 
can skip useless read faults.
Catalin Marinas May 23, 2024, 9:34 p.m. UTC | #5
On Thu, May 23, 2024 at 12:43:34PM -0700, Christoph Lameter (Ampere) wrote:
> On Thu, 23 May 2024, Catalin Marinas wrote:
> > > > While this class includes all atomics that currently require write
> > > > permission, there's some unallocated space in this range and we don't
> > > > know what future architecture versions may introduce. Unfortunately we
> > > > need to check each individual atomic op in this class (not sure what the
> > > > overhead will be).
> > > 
> > > Can you tell us which bits or pattern is not allocated? Maybe we can exclude
> > > that from the pattern.
> > 
> > Yes, it may be easier to exclude those patterns. See the Arm ARM K.a
> > section C4.1.94.29 (page 791).
> 
> Hmmm. We could consult an exception table once the pattern matches to reduce
> the overhead.

Yeah, check the atomic class first and then go into the finer-grained
details. I think this would reduce the overhead for non-atomic
instructions.

> However, the harm done I think is acceptable even if we leave things as is.
> In the worst case we create unnecesssary write fault processing for an
> "atomic op" that does not need write access. Also: Why would it need to be
> atomic if it does not write???

I'm thinking of some conditional instruction that states no write if
condition fails. But it could be even worse if the architects decide to
reuse that unallocated space for some instructions that have nothing to
do with the atomic accesses.

It's something we need to clarify with them but I'm about to go on
holiday for a week, so I won't be able to check.

> The ultimate solution would be to change the spec so that arm processors can
> skip useless read faults.

I raised this already, waiting for feedback from the architects.
Yang Shi May 23, 2024, 10:13 p.m. UTC | #6
On 5/23/24 2:34 PM, Catalin Marinas wrote:
> On Thu, May 23, 2024 at 12:43:34PM -0700, Christoph Lameter (Ampere) wrote:
>> On Thu, 23 May 2024, Catalin Marinas wrote:
>>>>> While this class includes all atomics that currently require write
>>>>> permission, there's some unallocated space in this range and we don't
>>>>> know what future architecture versions may introduce. Unfortunately we
>>>>> need to check each individual atomic op in this class (not sure what the
>>>>> overhead will be).
>>>> Can you tell us which bits or pattern is not allocated? Maybe we can exclude
>>>> that from the pattern.
>>> Yes, it may be easier to exclude those patterns. See the Arm ARM K.a
>>> section C4.1.94.29 (page 791).
>> Hmmm. We could consult an exception table once the pattern matches to reduce
>> the overhead.
> Yeah, check the atomic class first and then go into the finer-grained
> details. I think this would reduce the overhead for non-atomic
> instructions.

If I read the instruction encoding correctly, the unallocated 
instructions are decided by the below fields:

   - size
   - VAR
   - o3
   - opc

To exclude them I think we can do something like:

if atomic instructions {
     if V == 1
         return false;
     if o3 opc == 111x
         return false;
     switch VAR {
         000
             check o3 and opc
         001
             check 03 and opc
         010
             check o3 and opc
         011
             check o3 and opc
         default
             if size != 11
                 check o3 and opc
     }
}

So it may take 4 + the possible unallocated combos of o3 and opc 
branches for the worst case. I saw 5 different combos for o3 and opc, so 
9 branches for worst cases.

>
>> However, the harm done I think is acceptable even if we leave things as is.
>> In the worst case we create unnecesssary write fault processing for an
>> "atomic op" that does not need write access. Also: Why would it need to be
>> atomic if it does not write???
> I'm thinking of some conditional instruction that states no write if
> condition fails. But it could be even worse if the architects decide to
> reuse that unallocated space for some instructions that have nothing to
> do with the atomic accesses.

Even though the condition fails, forcing write fault still seems fine 
IIUC. I'm supposed the read fault will happen regardless of the 
condition. Then a page with all 0 content is installed. This is 
guaranteed. We just end up having write permission instead of read-only 
permission. We will also be in this state transiently with current 
supported atomic instructions.

But if they will be allocated to non-atomic instructions, we have to do 
fine-grained decoding, but it may be easier since we can just filter out 
those non-atomic instructions? Anyway it depends on how they will be 
used. Hopefully this won't happen.

>
> It's something we need to clarify with them but I'm about to go on
> holiday for a week, so I won't be able to check.

Have a good holiday.

>
>> The ultimate solution would be to change the spec so that arm processors can
>> skip useless read faults.
> I raised this already, waiting for feedback from the architects.

Thank you so much.

>
Catalin Marinas June 3, 2024, 4:06 p.m. UTC | #7
On Thu, May 23, 2024 at 03:13:23PM -0700, Yang Shi wrote:
> On 5/23/24 2:34 PM, Catalin Marinas wrote:
> > On Thu, May 23, 2024 at 12:43:34PM -0700, Christoph Lameter (Ampere) wrote:
> > > On Thu, 23 May 2024, Catalin Marinas wrote:
> > > > > > While this class includes all atomics that currently require write
> > > > > > permission, there's some unallocated space in this range and we don't
> > > > > > know what future architecture versions may introduce. Unfortunately we
> > > > > > need to check each individual atomic op in this class (not sure what the
> > > > > > overhead will be).
> > > > > 
> > > > > Can you tell us which bits or pattern is not allocated? Maybe we can exclude
> > > > > that from the pattern.
> > > > 
> > > > Yes, it may be easier to exclude those patterns. See the Arm ARM K.a
> > > > section C4.1.94.29 (page 791).
> > > 
> > > Hmmm. We could consult an exception table once the pattern matches to reduce
> > > the overhead.
> > 
> > Yeah, check the atomic class first and then go into the finer-grained
> > details. I think this would reduce the overhead for non-atomic
> > instructions.
> 
> If I read the instruction encoding correctly, the unallocated instructions
> are decided by the below fields:
> 
>   - size
>   - VAR
>   - o3
>   - opc
> 
> To exclude them I think we can do something like:
> 
> if atomic instructions {
>     if V == 1
>         return false;
>     if o3 opc == 111x
>         return false;
>     switch VAR {
>         000
>             check o3 and opc
>         001
>             check 03 and opc
>         010
>             check o3 and opc
>         011
>             check o3 and opc
>         default
>             if size != 11
>                 check o3 and opc
>     }
> }
> 
> So it may take 4 + the possible unallocated combos of o3 and opc branches
> for the worst case. I saw 5 different combos for o3 and opc, so 9 branches
> for worst cases.

Or we have a sorted table of exclusions and do a binary search. Not sure
which one is faster.

> But if they will be allocated to non-atomic instructions, we have to do
> fine-grained decoding, but it may be easier since we can just filter out
> those non-atomic instructions? Anyway it depends on how they will be used.
> Hopefully this won't happen.

Actually, the atomics table has LD64B and LDAPR already which are load
instructions, no write permission needed. So we need to exclude these
and all the unallocated space in this range.
Yang Shi June 3, 2024, 8:34 p.m. UTC | #8
On 6/3/24 9:06 AM, Catalin Marinas wrote:
> On Thu, May 23, 2024 at 03:13:23PM -0700, Yang Shi wrote:
>> On 5/23/24 2:34 PM, Catalin Marinas wrote:
>>> On Thu, May 23, 2024 at 12:43:34PM -0700, Christoph Lameter (Ampere) wrote:
>>>> On Thu, 23 May 2024, Catalin Marinas wrote:
>>>>>>> While this class includes all atomics that currently require write
>>>>>>> permission, there's some unallocated space in this range and we don't
>>>>>>> know what future architecture versions may introduce. Unfortunately we
>>>>>>> need to check each individual atomic op in this class (not sure what the
>>>>>>> overhead will be).
>>>>>> Can you tell us which bits or pattern is not allocated? Maybe we can exclude
>>>>>> that from the pattern.
>>>>> Yes, it may be easier to exclude those patterns. See the Arm ARM K.a
>>>>> section C4.1.94.29 (page 791).
>>>> Hmmm. We could consult an exception table once the pattern matches to reduce
>>>> the overhead.
>>> Yeah, check the atomic class first and then go into the finer-grained
>>> details. I think this would reduce the overhead for non-atomic
>>> instructions.
>> If I read the instruction encoding correctly, the unallocated instructions
>> are decided by the below fields:
>>
>>    - size
>>    - VAR
>>    - o3
>>    - opc
>>
>> To exclude them I think we can do something like:
>>
>> if atomic instructions {
>>      if V == 1
>>          return false;
>>      if o3 opc == 111x
>>          return false;
>>      switch VAR {
>>          000
>>              check o3 and opc
>>          001
>>              check 03 and opc
>>          010
>>              check o3 and opc
>>          011
>>              check o3 and opc
>>          default
>>              if size != 11
>>                  check o3 and opc
>>      }
>> }
>>
>> So it may take 4 + the possible unallocated combos of o3 and opc branches
>> for the worst case. I saw 5 different combos for o3 and opc, so 9 branches
>> for worst cases.
> Or we have a sorted table of exclusions and do a binary search. Not sure
> which one is faster.
>
>> But if they will be allocated to non-atomic instructions, we have to do
>> fine-grained decoding, but it may be easier since we can just filter out
>> those non-atomic instructions? Anyway it depends on how they will be used.
>> Hopefully this won't happen.
> Actually, the atomics table has LD64B and LDAPR already which are load
> instructions, no write permission needed. So we need to exclude these
> and all the unallocated space in this range.

OK. Excluding LD64B and LDAPR actually makes the check much simpler if 
we return true for supported instructions instead of checking 
unallocated instructions. It looks like:

((val & 0x3f207c00) == 0x38200000) |
((val & 0x3f208c00) == 0x38200000) |
((val & 0x7fe06c00) == 0x78202000) |
((val & 0xbf204c00) == 0x38200000)

Thanks D Scott help figure this out.

>
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index db1aeacd4cd9..1cc73664fc55 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -319,6 +319,7 @@  static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
  * "-" means "don't care"
  */
 __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
+__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)
 
 __AARCH64_INSN_FUNCS(adr,	0x9F000000, 0x10000000)
 __AARCH64_INSN_FUNCS(adrp,	0x9F000000, 0x90000000)
@@ -339,6 +340,7 @@  __AARCH64_INSN_FUNCS(ldeor,	0x3F20FC00, 0x38202000)
 __AARCH64_INSN_FUNCS(ldset,	0x3F20FC00, 0x38203000)
 __AARCH64_INSN_FUNCS(swp,	0x3F20FC00, 0x38208000)
 __AARCH64_INSN_FUNCS(cas,	0x3FA07C00, 0x08A07C00)
+__AARCH64_INSN_FUNCS(casp,	0xBFA07C00, 0x08207C00)
 __AARCH64_INSN_FUNCS(ldr_reg,	0x3FE0EC00, 0x38606800)
 __AARCH64_INSN_FUNCS(signed_ldr_reg, 0X3FE0FC00, 0x38A0E800)
 __AARCH64_INSN_FUNCS(ldr_imm,	0x3FC00000, 0x39400000)
@@ -543,6 +545,12 @@  static __always_inline bool aarch64_insn_uses_literal(u32 insn)
 	       aarch64_insn_is_prfm_lit(insn);
 }
 
+static __always_inline bool aarch64_insn_is_class_cas(u32 insn)
+{
+	return aarch64_insn_is_cas(insn) ||
+	       aarch64_insn_is_casp(insn);
+}
+
 enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
 u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
 u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 8251e2fea9c7..73f954fcb8c7 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -519,6 +519,30 @@  static bool is_write_abort(unsigned long esr)
 	return (esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM);
 }
 
+static bool is_el0_atomic_instr(struct pt_regs *regs)
+{
+	u32 insn;
+	__le32 insn_le;
+	unsigned long pc = instruction_pointer(regs);
+
+	if (!user_mode(regs) || compat_user_mode(regs))
+		return false;
+
+	pagefault_disable();
+	if (get_user(insn_le, (__le32 __user *)pc)) {
+		pagefault_enable();
+		return false;
+	}
+	pagefault_enable();
+
+	insn = le32_to_cpu(insn_le);
+	if (aarch64_insn_is_class_atomic(insn) ||
+	    aarch64_insn_is_class_cas(insn))
+		return true;
+
+	return false;
+}
+
 static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 				   struct pt_regs *regs)
 {
@@ -529,6 +553,7 @@  static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	unsigned int mm_flags = FAULT_FLAG_DEFAULT;
 	unsigned long addr = untagged_addr(far);
 	struct vm_area_struct *vma;
+	bool force_write = false;
 
 	if (kprobe_page_fault(regs, esr))
 		return 0;
@@ -557,6 +582,11 @@  static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 		/* It was write fault */
 		vm_flags = VM_WRITE;
 		mm_flags |= FAULT_FLAG_WRITE;
+	} else if (is_el0_atomic_instr(regs)) {
+		/* Force write fault */
+		vm_flags = VM_WRITE;
+		mm_flags |= FAULT_FLAG_WRITE;
+		force_write = true;
 	} else {
 		/* It was read fault */
 		vm_flags = VM_READ;
@@ -586,6 +616,14 @@  static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	if (!vma)
 		goto lock_mmap;
 
+	/* vma flags don't allow write, undo force write */
+	if (force_write && !(vma->vm_flags & VM_WRITE)) {
+		vm_flags |= VM_READ;
+		if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN))
+			vm_flags |= VM_EXEC;
+		mm_flags &= ~FAULT_FLAG_WRITE;
+	}
+
 	if (!(vma->vm_flags & vm_flags)) {
 		vma_end_read(vma);
 		goto lock_mmap;