mbox series

[RFC,v9,00/13] Add support for eXclusive Page Frame Ownership

Message ID cover.1554248001.git.khalid.aziz@oracle.com (mailing list archive)
Headers show
Series Add support for eXclusive Page Frame Ownership | expand

Message

Khalid Aziz April 3, 2019, 5:34 p.m. UTC
This is another update to the work Juerg, Tycho and Julian have
done on XPFO. After the last round of updates, we were seeing very
significant performance penalties when stale TLB entries were
flushed actively after an XPFO TLB update.  Benchmark for measuring
performance is kernel build using parallel make. To get full
protection from ret2dir attackes, we must flush stale TLB entries.
Performance penalty from flushing stale TLB entries goes up as the
number of cores goes up. On a desktop class machine with only 4
cores, enabling TLB flush for stale entries causes system time for
"make -j4" to go up by a factor of 2.61x but on a larger machine
with 96 cores, system time with "make -j60" goes up by a factor of
26.37x!  I have been working on reducing this performance penalty.

I implemented two solutions to reduce performance penalty and that
has had large impact. XPFO code flushes TLB every time a page is
allocated to userspace. It does so by sending IPIs to all processors
to flush TLB. Back to back allocations of pages to userspace on
multiple processors results in a storm of IPIs.  Each one of these
incoming IPIs is handled by a processor by flushing its TLB. To
reduce this IPI storm, I have added a per CPU flag that can be set
to tell a processor to flush its TLB. A processor checks this flag
on every context switch. If the flag is set, it flushes its TLB and
clears the flag. This allows for multiple TLB flush requests to a
single CPU to be combined into a single request. A kernel TLB entry
for a page that has been allocated to userspace is flushed on all
processors unlike the previous version of this patch. A processor
could hold a stale kernel TLB entry that was removed on another
processor until the next context switch. A local userspace page
allocation by the currently running process could force the TLB
flush earlier for such entries.

The other solution reduces the number of TLB flushes required, by
performing TLB flush for multiple pages at one time when pages are
refilled on the per-cpu freelist. If the pages being addedd to
per-cpu freelist are marked for userspace allocation, TLB entries
for these pages can be flushed upfront and pages tagged as currently
unmapped. When any such page is allocated to userspace, there is no
need to performa a TLB flush at that time any more. This batching of
TLB flushes reduces performance imapct further. Similarly when
these user pages are freed by userspace and added back to per-cpu
free list, they are left unmapped and tagged so. This further
optimization reduced performance impact from 1.32x to 1.28x for
96-core server and from 1.31x to 1.27x for a 4-core desktop.

I measured system time for parallel make with unmodified 4.20
kernel, 4.20 with XPFO patches before these patches and then again
after applying each of these patches. Here are the results:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

5.0					913.862s
5.0+this patch series			1165.259ss	1.28x


Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

5.0					610.642s
5.0+this patch series			773.075s	1.27x

Performance with this patch set is good enough to use these as
starting point for further refinement before we merge it into main
kernel, hence RFC.

I have restructurerd the patches in this version to separate out
architecture independent code. I folded much of the code
improvement by Julian to not use page extension into patch 3. 

What remains to be done beyond this patch series:

1. Performance improvements: Ideas to explore - (1) kernel mappings
   private to an mm, (2) Any others??
2. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
   from Juerg. I dropped it for now since swiotlb code for ARM has
   changed a lot since this patch was written. I could use help
   from ARM experts on this.
3. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
   CPUs" to other architectures besides x86.
4. Change kmap to not map the page back to physmap, instead map it
   to a new va similar to what kmap_high does. Mapping page back
   into physmap re-opens the ret2dir security for the duration of
   kmap. All of the kmap_high and related code can be reused for this
   but that will require restructuring that code so it can be built for
   64-bits as well. Any objections to that?

---------------------------------------------------------

Juerg Haefliger (6):
  mm: Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo, x86: Add support for XPFO for x86-64
  lkdtm: Add test for XPFO
  arm64/mm: Add support for XPFO
  swiotlb: Map the buffer if it was unmapped by XPFO
  arm64/mm, xpfo: temporarily map dcache regions

Julian Stecklina (1):
  xpfo, mm: optimize spinlock usage in xpfo_kunmap

Khalid Aziz (2):
  xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  xpfo, mm: Optimize XPFO TLB flushes by batching them together

Tycho Andersen (4):
  mm: add MAP_HUGETLB support to vm_mmap
  x86: always set IF before oopsing from page fault
  mm: add a user_virt_to_phys symbol
  xpfo: add primitives for mapping underlying memory

 .../admin-guide/kernel-parameters.txt         |   6 +
 arch/arm64/Kconfig                            |   1 +
 arch/arm64/mm/Makefile                        |   2 +
 arch/arm64/mm/flush.c                         |   7 +
 arch/arm64/mm/mmu.c                           |   2 +-
 arch/arm64/mm/xpfo.c                          |  66 ++++++
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/pgtable.h                |  26 +++
 arch/x86/include/asm/tlbflush.h               |   1 +
 arch/x86/mm/Makefile                          |   2 +
 arch/x86/mm/fault.c                           |   6 +
 arch/x86/mm/pageattr.c                        |  32 +--
 arch/x86/mm/tlb.c                             |  39 ++++
 arch/x86/mm/xpfo.c                            | 185 +++++++++++++++++
 drivers/misc/lkdtm/Makefile                   |   1 +
 drivers/misc/lkdtm/core.c                     |   3 +
 drivers/misc/lkdtm/lkdtm.h                    |   5 +
 drivers/misc/lkdtm/xpfo.c                     | 196 ++++++++++++++++++
 include/linux/highmem.h                       |  34 +--
 include/linux/mm.h                            |   2 +
 include/linux/mm_types.h                      |   8 +
 include/linux/page-flags.h                    |  23 +-
 include/linux/xpfo.h                          | 191 +++++++++++++++++
 include/trace/events/mmflags.h                |  10 +-
 kernel/dma/swiotlb.c                          |   3 +-
 mm/Makefile                                   |   1 +
 mm/compaction.c                               |   2 +-
 mm/internal.h                                 |   2 +-
 mm/mmap.c                                     |  19 +-
 mm/page_alloc.c                               |  19 +-
 mm/page_isolation.c                           |   2 +-
 mm/util.c                                     |  32 +++
 mm/xpfo.c                                     | 170 +++++++++++++++
 security/Kconfig                              |  27 +++
 34 files changed, 1047 insertions(+), 79 deletions(-)
 create mode 100644 arch/arm64/mm/xpfo.c
 create mode 100644 arch/x86/mm/xpfo.c
 create mode 100644 drivers/misc/lkdtm/xpfo.c
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

Comments

Nadav Amit April 4, 2019, 4:44 p.m. UTC | #1
> On Apr 3, 2019, at 10:34 AM, Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
> This is another update to the work Juerg, Tycho and Julian have
> done on XPFO.

Interesting work, but note that it triggers a warning on my system due to
possible deadlock. It seems that the patch-set disables IRQs in
xpfo_kunmap() and then might flush remote TLBs when a large page is split.
This is wrong, since it might lead to deadlocks.


[  947.262208] WARNING: CPU: 6 PID: 9892 at kernel/smp.c:416 smp_call_function_many+0x92/0x250
[  947.263767] Modules linked in: sb_edac vmw_balloon crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel input_leds intel_rapl_perf serio_raw mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core vmw_vsock_vmci_transport vsock vmw_vmci iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear hid_generic usbhid hid vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm aesni_intel psmouse aes_x86_64 crypto_simd cryptd glue_helper mptspi vmxnet3 scsi_transport_spi mptscsih ahci mptbase libahci i2c_piix4 pata_acpi
[  947.274649] CPU: 6 PID: 9892 Comm: cc1 Not tainted 5.0.0+ #7
[  947.275804] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/28/2017
[  947.277704] RIP: 0010:smp_call_function_many+0x92/0x250
[  947.278640] Code: 3b 05 66 fc 4e 01 72 26 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 8b 05 2b cc 7e 01 85 c0 75 bf 80 3d a8 99 4e 01 00 75 b6 <0f> 0b eb b2 44 89 c7 48 c7 c2 a0 9a 61 aa 4c 89 fe 44 89 45 d0 e8
[  947.281895] RSP: 0000:ffffafe04538f970 EFLAGS: 00010046
[  947.282821] RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000001
[  947.284084] RDX: 0000000000000000 RSI: ffffffffa9078d70 RDI: ffffffffaa619aa0
[  947.285343] RBP: ffffafe04538f9a8 R08: ffff9d7040000ff0 R09: 0000000000000000
[  947.286596] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa9078d70
[  947.287855] R13: 0000000000000000 R14: 0000000000000001 R15: ffffffffaa619aa0
[  947.289118] FS:  00007f668b122ac0(0000) GS:ffff9d727fd80000(0000) knlGS:0000000000000000
[  947.290550] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  947.291569] CR2: 00007f6688389004 CR3: 0000000224496006 CR4: 00000000003606e0
[  947.292861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  947.294125] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  947.295394] Call Trace:
[  947.295854]  ? load_new_mm_cr3+0xe0/0xe0
[  947.296568]  on_each_cpu+0x2d/0x60
[  947.297191]  flush_tlb_all+0x1c/0x20
[  947.297846]  __split_large_page+0x5d9/0x640
[  947.298604]  set_kpte+0xfe/0x260
[  947.299824]  get_page_from_freelist+0x1633/0x1680
[  947.301260]  ? lookup_address+0x2d/0x30
[  947.302550]  ? set_kpte+0x1e1/0x260
[  947.303760]  __alloc_pages_nodemask+0x13f/0x2e0
[  947.305137]  alloc_pages_vma+0x7a/0x1c0
[  947.306378]  wp_page_copy+0x201/0xa30
[  947.307582]  ? generic_file_read_iter+0x96a/0xcf0
[  947.308946]  do_wp_page+0x1cc/0x420
[  947.310086]  __handle_mm_fault+0xc0d/0x1600
[  947.311331]  handle_mm_fault+0xe1/0x210
[  947.312502]  __do_page_fault+0x23a/0x4c0
[  947.313672]  ? _cond_resched+0x19/0x30
[  947.314795]  do_page_fault+0x2e/0xe0
[  947.315878]  ? page_fault+0x8/0x30
[  947.316916]  page_fault+0x1e/0x30
[  947.317930] RIP: 0033:0x76581e
[  947.318893] Code: eb 05 89 d8 48 8d 04 80 48 8d 34 c5 08 00 00 00 48 85 ff 74 04 44 8b 67 04 e8 de 80 08 00 81 e3 ff ff ff 7f 48 89 45 00 8b 10 <44> 89 60 04 81 e2 00 00 00 80 09 da 89 10 c1 ea 18 83 e2 7f 88 50
[  947.323337] RSP: 002b:00007ffde06c0e40 EFLAGS: 00010202
[  947.324663] RAX: 00007f6688389000 RBX: 0000000000000004 RCX: 0000000000000001
[  947.326317] RDX: 0000000000000000 RSI: 0000000001000001 RDI: 0000000000000017
[  947.327973] RBP: 00007f66883882d8 R08: 00000000032e05f0 R09: 00007f668b30e6f0
[  947.329619] R10: 0000000000000002 R11: 00000000032e05f0 R12: 0000000000000000
[  947.331260] R13: 00007f6688388230 R14: 00007f6688388288 R15: 00007f668ac3b0a8
[  947.332911] ---[ end trace 7d605a38c67d83ae ]---
Khalid Aziz April 4, 2019, 5:18 p.m. UTC | #2
On 4/4/19 10:44 AM, Nadav Amit wrote:
>> On Apr 3, 2019, at 10:34 AM, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> This is another update to the work Juerg, Tycho and Julian have
>> done on XPFO.
> 
> Interesting work, but note that it triggers a warning on my system due to
> possible deadlock. It seems that the patch-set disables IRQs in
> xpfo_kunmap() and then might flush remote TLBs when a large page is split.
> This is wrong, since it might lead to deadlocks.
> 
> 
> [  947.262208] WARNING: CPU: 6 PID: 9892 at kernel/smp.c:416 smp_call_function_many+0x92/0x250
> [  947.263767] Modules linked in: sb_edac vmw_balloon crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel input_leds intel_rapl_perf serio_raw mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core vmw_vsock_vmci_transport vsock vmw_vmci iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear hid_generic usbhid hid vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm aesni_intel psmouse aes_x86_64 crypto_simd cryptd glue_helper mptspi vmxnet3 scsi_transport_spi mptscsih ahci mptbase libahci i2c_piix4 pata_acpi
> [  947.274649] CPU: 6 PID: 9892 Comm: cc1 Not tainted 5.0.0+ #7
> [  947.275804] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/28/2017
> [  947.277704] RIP: 0010:smp_call_function_many+0x92/0x250
> [  947.278640] Code: 3b 05 66 fc 4e 01 72 26 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 8b 05 2b cc 7e 01 85 c0 75 bf 80 3d a8 99 4e 01 00 75 b6 <0f> 0b eb b2 44 89 c7 48 c7 c2 a0 9a 61 aa 4c 89 fe 44 89 45 d0 e8
> [  947.281895] RSP: 0000:ffffafe04538f970 EFLAGS: 00010046
> [  947.282821] RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000001
> [  947.284084] RDX: 0000000000000000 RSI: ffffffffa9078d70 RDI: ffffffffaa619aa0
> [  947.285343] RBP: ffffafe04538f9a8 R08: ffff9d7040000ff0 R09: 0000000000000000
> [  947.286596] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa9078d70
> [  947.287855] R13: 0000000000000000 R14: 0000000000000001 R15: ffffffffaa619aa0
> [  947.289118] FS:  00007f668b122ac0(0000) GS:ffff9d727fd80000(0000) knlGS:0000000000000000
> [  947.290550] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  947.291569] CR2: 00007f6688389004 CR3: 0000000224496006 CR4: 00000000003606e0
> [  947.292861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  947.294125] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  947.295394] Call Trace:
> [  947.295854]  ? load_new_mm_cr3+0xe0/0xe0
> [  947.296568]  on_each_cpu+0x2d/0x60
> [  947.297191]  flush_tlb_all+0x1c/0x20
> [  947.297846]  __split_large_page+0x5d9/0x640
> [  947.298604]  set_kpte+0xfe/0x260
> [  947.299824]  get_page_from_freelist+0x1633/0x1680
> [  947.301260]  ? lookup_address+0x2d/0x30
> [  947.302550]  ? set_kpte+0x1e1/0x260
> [  947.303760]  __alloc_pages_nodemask+0x13f/0x2e0
> [  947.305137]  alloc_pages_vma+0x7a/0x1c0
> [  947.306378]  wp_page_copy+0x201/0xa30
> [  947.307582]  ? generic_file_read_iter+0x96a/0xcf0
> [  947.308946]  do_wp_page+0x1cc/0x420
> [  947.310086]  __handle_mm_fault+0xc0d/0x1600
> [  947.311331]  handle_mm_fault+0xe1/0x210
> [  947.312502]  __do_page_fault+0x23a/0x4c0
> [  947.313672]  ? _cond_resched+0x19/0x30
> [  947.314795]  do_page_fault+0x2e/0xe0
> [  947.315878]  ? page_fault+0x8/0x30
> [  947.316916]  page_fault+0x1e/0x30
> [  947.317930] RIP: 0033:0x76581e
> [  947.318893] Code: eb 05 89 d8 48 8d 04 80 48 8d 34 c5 08 00 00 00 48 85 ff 74 04 44 8b 67 04 e8 de 80 08 00 81 e3 ff ff ff 7f 48 89 45 00 8b 10 <44> 89 60 04 81 e2 00 00 00 80 09 da 89 10 c1 ea 18 83 e2 7f 88 50
> [  947.323337] RSP: 002b:00007ffde06c0e40 EFLAGS: 00010202
> [  947.324663] RAX: 00007f6688389000 RBX: 0000000000000004 RCX: 0000000000000001
> [  947.326317] RDX: 0000000000000000 RSI: 0000000001000001 RDI: 0000000000000017
> [  947.327973] RBP: 00007f66883882d8 R08: 00000000032e05f0 R09: 00007f668b30e6f0
> [  947.329619] R10: 0000000000000002 R11: 00000000032e05f0 R12: 0000000000000000
> [  947.331260] R13: 00007f6688388230 R14: 00007f6688388288 R15: 00007f668ac3b0a8
> [  947.332911] ---[ end trace 7d605a38c67d83ae ]---
> 

Thanks for letting me know. xpfo_kunmap() is not quite right. It will
end up being rewritten for the next version.

--
Khalid
Jon Masters April 6, 2019, 6:40 a.m. UTC | #3
Khalid,

Thanks for these patches. We will test them on x86 and investigate the Arm pieces highlighted.

Jon.