[0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU

Message ID	20230808085056.14644-1-yan.y.zhao@intel.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: Yan Zhao <yan.y.zhao@intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: pbonzini@redhat.com, seanjc@google.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU Date: Tue, 8 Aug 2023 16:50:56 +0800 Message-Id: <20230808085056.14644-1-yan.y.zhao@intel.com> Precedence: bulk
Series	KVM: x86/mmu: .change_pte() optimization in TDP MMU \| expand [0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU [1/2] KVM: x86/mmu: Remove dead code in .change_pte() handler in x86 TDP MMU [2/2] KVM: x86/mmu: prefetch SPTE directly in x86 TDP MMU's change_pte() handler

Yan Zhao Aug. 8, 2023, 8:50 a.m. UTC

This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU
(i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching
notified new PFN into SPTEs directly in the handler.

As in [1], .change_pte() has been dead code on x86 for 10+ years.
Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare
for optimization in TDP MMU in patch 2.

Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the
handler directly with PFN info contained in .change_pte() to avoid that
each vCPU write that triggers .change_pte() must undergo twice VMExits and
TDP page faults.

base-commit: fdf0eaf11452 + Sean's patch "KVM: Wrap kvm_{gfn,hva}_range.pte
in a per-action union" [2]

[1]: https://lore.kernel.org/lkml/ZMAO6bhan9l6ybQM@google.com/
[2]:
https://lore.kernel.org/lkml/20230729004144.1054885-1-seanjc@google.com/

Yan Zhao (2):
  KVM: x86/mmu: Remove dead code in .change_pte() handler in x86 TDP MMU
  KVM: x86/mmu: prefetch SPTE directly in x86 TDP MMU's change_pte()
    handler

 arch/x86/kvm/mmu/tdp_mmu.c | 101 +++++++++++++++++++++++++------------
 1 file changed, 68 insertions(+), 33 deletions(-)

Sean Christopherson Aug. 16, 2023, 6:18 p.m. UTC | #1

On Tue, Aug 08, 2023, Yan Zhao wrote:
> This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU
> (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching
> notified new PFN into SPTEs directly in the handler.
> 
> As in [1], .change_pte() has been dead code on x86 for 10+ years.
> Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare
> for optimization in TDP MMU in patch 2.

If we're going to officially kill the long-dead attempt at optimizing KSM, I'd
strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all
architectures and remove it from mmu_notifiers.  The only reason I haven't proposed
such patches is because I didn't want to it to backfire and lead to someone trying
to resurrect the optimizations for KSM.

> Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the
> handler directly with PFN info contained in .change_pte() to avoid that
> each vCPU write that triggers .change_pte() must undergo twice VMExits and
> TDP page faults.

IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by
using QEMU's prealloc option.  It's more coarse grained, but at a minimum it's
sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB.

And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to
prefault memory not just into the primary MMU, but also into KVM's MMU.  Such an
ioctl() is basically manadatory for TDX, we just need to morph the support being
added by TDX into a generic ioctl()[*]

Prefaulting guest memory as writable into the primary MMU should be able to achieve
far better performance than hooking .change_pte(), as it will avoid the mmu_notifier
invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote
TLB flush(es).  And a KVM ioctl() to prefault into KVM's MMU should eliminate page
fault VM-Exits entirely.

Explicit prefaulting isn't perfect, but IMO the value added by prefetching in
.change_pte() isn't enough to justify carrying the hook and the code in KVM.

[*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com

Yan Zhao Aug. 17, 2023, midnight UTC | #2

On Wed, Aug 16, 2023 at 11:18:03AM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Yan Zhao wrote:
> > This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU
> > (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching
> > notified new PFN into SPTEs directly in the handler.
> > 
> > As in [1], .change_pte() has been dead code on x86 for 10+ years.
> > Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare
> > for optimization in TDP MMU in patch 2.
> 
> If we're going to officially kill the long-dead attempt at optimizing KSM, I'd
> strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all
> architectures and remove it from mmu_notifiers.  The only reason I haven't proposed
> such patches is because I didn't want to it to backfire and lead to someone trying
> to resurrect the optimizations for KSM.
> 
> > Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the
> > handler directly with PFN info contained in .change_pte() to avoid that
> > each vCPU write that triggers .change_pte() must undergo twice VMExits and
> > TDP page faults.
> 
> IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by
> using QEMU's prealloc option.  It's more coarse grained, but at a minimum it's
> sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB.
> 
> And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to
> prefault memory not just into the primary MMU, but also into KVM's MMU.  Such an
> ioctl() is basically manadatory for TDX, we just need to morph the support being
> added by TDX into a generic ioctl()[*]
> 
> Prefaulting guest memory as writable into the primary MMU should be able to achieve
> far better performance than hooking .change_pte(), as it will avoid the mmu_notifier
> invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote
> TLB flush(es).  And a KVM ioctl() to prefault into KVM's MMU should eliminate page
> fault VM-Exits entirely.
> 
> Explicit prefaulting isn't perfect, but IMO the value added by prefetching in
> .change_pte() isn't enough to justify carrying the hook and the code in KVM.
> 
> [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com
Hi Sean,
As I didn't write the full picture of patch 2 in the cover letter well,
may I request you to take a look of patch 2 to see if you like it? (in
case if you just read the cover letter).
What I observed is that each vCPU write to a COW page in primary MMU
will lead to twice TDP page faults.
Then, I just update the secondary MMU during the first TDP page fault
to avoid the second one.
It's not a blind prefetch (I checked the vCPU to ensure it's triggered
by a vCPU operation as much as possible) and it can benefit guests who
doesn't explicitly request a prefault memory as write.


Thanks
Yan

Sean Christopherson Aug. 17, 2023, 5:53 p.m. UTC | #3

On Thu, Aug 17, 2023, Yan Zhao wrote:
> On Wed, Aug 16, 2023 at 11:18:03AM -0700, Sean Christopherson wrote:
> > On Tue, Aug 08, 2023, Yan Zhao wrote:
> > > This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU
> > > (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching
> > > notified new PFN into SPTEs directly in the handler.
> > > 
> > > As in [1], .change_pte() has been dead code on x86 for 10+ years.
> > > Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare
> > > for optimization in TDP MMU in patch 2.
> > 
> > If we're going to officially kill the long-dead attempt at optimizing KSM, I'd
> > strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all
> > architectures and remove it from mmu_notifiers.  The only reason I haven't proposed
> > such patches is because I didn't want to it to backfire and lead to someone trying
> > to resurrect the optimizations for KSM.
> > 
> > > Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the
> > > handler directly with PFN info contained in .change_pte() to avoid that
> > > each vCPU write that triggers .change_pte() must undergo twice VMExits and
> > > TDP page faults.
> > 
> > IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by
> > using QEMU's prealloc option.  It's more coarse grained, but at a minimum it's
> > sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB.
> > 
> > And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to
> > prefault memory not just into the primary MMU, but also into KVM's MMU.  Such an
> > ioctl() is basically manadatory for TDX, we just need to morph the support being
> > added by TDX into a generic ioctl()[*]
> > 
> > Prefaulting guest memory as writable into the primary MMU should be able to achieve
> > far better performance than hooking .change_pte(), as it will avoid the mmu_notifier
> > invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote
> > TLB flush(es).  And a KVM ioctl() to prefault into KVM's MMU should eliminate page
> > fault VM-Exits entirely.
> > 
> > Explicit prefaulting isn't perfect, but IMO the value added by prefetching in
> > .change_pte() isn't enough to justify carrying the hook and the code in KVM.
> > 
> > [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com
> Hi Sean,
> As I didn't write the full picture of patch 2 in the cover letter well,
> may I request you to take a look of patch 2 to see if you like it? (in
> case if you just read the cover letter).

I read patch two, I replied to the cover letter as I wanted to discuss the two
patches together since implementing the CoW optimization effectively means
dropping the long-dead KSM optimization.

> What I observed is that each vCPU write to a COW page in primary MMU
> will lead to twice TDP page faults.
> Then, I just update the secondary MMU during the first TDP page fault
> to avoid the second one.
> It's not a blind prefetch (I checked the vCPU to ensure it's triggered
> by a vCPU operation as much as possible)

Yes, that's part of the complexity I don't like.

> and it can benefit guests who doesn't explicitly request a prefault memory as
> write.

Yes, I'm arguing that the benefit isn't significant, and that the use cases it
might benefit aren't things people care about optimizing.

I'm very skeptical that shaving those 8000 VM-Exits will translate to a meaningful
reduction in guest boot time, let alone scale beyond very specific scenarios and
configurations, which again, are likely suboptimal in nature.  Actually, they most
definitely are suboptimal, because the fact that this provides any benefit
whatsoever means that either your VM isn't being backed with hugepages, or it's
being backed with THP and transparent_hugepage/use_zero_page is enabled (and thus
is generating CoW behavior).

Enabling THP or using HugeTLB (which again can be done on a subset of guest memory)
will have a far, far bigger impact on guest performance.  Ditto for disabling
using the huge zero_page when backing VMs with THP (any page touched by the guest
is all but guaranteed to be written sooner than later, so using the zero_page
doesn't make a whole lot of sense).

E.g. a single CoW operation will take mmu_lock for write three times:
invalidate_range_start(), change_pte(), and invalidate_range_end(), not to mention
the THP zero_page CoW will first fault-in a read-only mapping, then split that
mapping, and then do CoW on the 4KiB PTEs, which is *really* suboptimal.

Actually, I don't even completely understand how you're seeing CoW behavior in
the first place.  No sane guest should blindly read (or execute) uninitialized
memory.  IIUC, you're not running a Windows guest, and even if you are, AFAIK
QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
been zeroed by the hypervisor.  If KSM is to blame, then my answer it to turn off
KSM, because turning on KSM is antithetical to guest performance (not to mention
that KSM is wildly insecure for the guest, especially given the number of speculative
execution attacks these days).

If there's something else going on, i.e. if your VM really is somehow generating
reads before writes, and if we really want to optimize use cases that can't use
hugepages for whatever reason, I would much prefer to do something like add a
memslot flag to state that the memslot should *always* be mapped writable.  Because
outside of setups that use KSM, the only reason I can think of to not map memory
writable straightaway is if userspace somehow knows the guest isn't going to write
that memory.

If it weren't for KSM, and if it wouldn't potentially be a breaking change, I
would even go so far as to say that KVM should always map writable memslots as
writable in the guest.

E.g. minus the uAPI, this is a lot simpler to implement and maintain.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfbaafbe3a00..6c4640483881 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2727,10 +2727,14 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
                return KVM_PFN_NOSLOT;
        }

-       /* Do not map writable pfn in the readonly memslot. */
-       if (writable && memslot_is_readonly(slot)) {
-               *writable = false;
-               writable = NULL;
+       if (writable) {
+               if (memslot_is_readonly(slot)) {
+                       *writable = false;
+                       writable = NULL;
+               } else if (memslot_is_always_writable(slot)) {
+                       *writable = true;
+                       write_fault = true;
+               }
        }

        return hva_to_pfn(addr, atomic, interruptible, async, write_fault,

And FWIW, removing .change_pte() entirely, even without any other optimizations,
will also benefit those guests, as it will remove a source of mmu_lock contention
along with all of the overhead of invoking callbacks, walking memslots, etc.  And
removing .change_pte() will benefit *all* guests by eliminating unrelated callbacks,
i.e. callbacks when memory for the VMM takes a CoW fault.

So yeah, unless I'm misunderstanding the bigger picture, the more I look at this,
the more I'm against it.

Yan Zhao Aug. 18, 2023, 10:17 a.m. UTC | #4

On Thu, Aug 17, 2023 at 10:53:25AM -0700, Sean Christopherson wrote:
...

> > What I observed is that each vCPU write to a COW page in primary MMU
> > will lead to twice TDP page faults.
> > Then, I just update the secondary MMU during the first TDP page fault
> > to avoid the second one.
> > It's not a blind prefetch (I checked the vCPU to ensure it's triggered
> > by a vCPU operation as much as possible)
> 
> Yes, that's part of the complexity I don't like.
> 
> > and it can benefit guests who doesn't explicitly request a prefault memory as
> > write.
> 
> Yes, I'm arguing that the benefit isn't significant, and that the use cases it
> might benefit aren't things people care about optimizing.
> 
> I'm very skeptical that shaving those 8000 VM-Exits will translate to a meaningful
> reduction in guest boot time, let alone scale beyond very specific scenarios and
> configurations, which again, are likely suboptimal in nature.  Actually, they most
> definitely are suboptimal, because the fact that this provides any benefit
> whatsoever means that either your VM isn't being backed with hugepages, or it's
> being backed with THP and transparent_hugepage/use_zero_page is enabled (and thus
> is generating CoW behavior).
Yes, it's being backed with THP and transparent_hugepage/use_zero_page is enabled.

> 
> Enabling THP or using HugeTLB (which again can be done on a subset of guest memory)
> will have a far, far bigger impact on guest performance.  Ditto for disabling
> using the huge zero_page when backing VMs with THP (any page touched by the guest
> is all but guaranteed to be written sooner than later, so using the zero_page
> doesn't make a whole lot of sense).
> 
> E.g. a single CoW operation will take mmu_lock for write three times:
> invalidate_range_start(), change_pte(), and invalidate_range_end(), not to mention
> the THP zero_page CoW will first fault-in a read-only mapping, then split that
> mapping, and then do CoW on the 4KiB PTEs, which is *really* suboptimal.
> 
> Actually, I don't even completely understand how you're seeing CoW behavior in
> the first place.  No sane guest should blindly read (or execute) uninitialized
> memory.  IIUC, you're not running a Windows guest, and even if you are, AFAIK
> QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> been zeroed by the hypervisor.  If KSM is to blame, then my answer it to turn off
> KSM, because turning on KSM is antithetical to guest performance (not to mention
> that KSM is wildly insecure for the guest, especially given the number of speculative
> execution attacks these days).
I'm running a linux guest.
KSM is not turned on both in guest and host.
Both guest and host have turned on transparent huge page.

The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
which will cause
    (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
        mapped.
    (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
        which will fail

The guest then writes the GFN.
This step will trigger (huge pmd split for huge page case) and .change_pte().

My guest is surely a sane guest. But currently I can't find out why
certain pages are read before write.
Will return back to you the reason after figuring it out after my long vacation.

> 
> If there's something else going on, i.e. if your VM really is somehow generating
> reads before writes, and if we really want to optimize use cases that can't use
> hugepages for whatever reason, I would much prefer to do something like add a
> memslot flag to state that the memslot should *always* be mapped writable.  Because
Will check if this flag is necessary after figuring out the reason.

> outside of setups that use KSM, the only reason I can think of to not map memory
> writable straightaway is if userspace somehow knows the guest isn't going to write
> that memory.
> 
> If it weren't for KSM, and if it wouldn't potentially be a breaking change, I
> would even go so far as to say that KVM should always map writable memslots as
> writable in the guest.
> 
> E.g. minus the uAPI, this is a lot simpler to implement and maintain.
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dfbaafbe3a00..6c4640483881 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2727,10 +2727,14 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>                 return KVM_PFN_NOSLOT;
>         }
>  
> -       /* Do not map writable pfn in the readonly memslot. */
> -       if (writable && memslot_is_readonly(slot)) {
> -               *writable = false;
> -               writable = NULL;
> +       if (writable) {
> +               if (memslot_is_readonly(slot)) {
> +                       *writable = false;
> +                       writable = NULL;
> +               } else if (memslot_is_always_writable(slot)) {
> +                       *writable = true;
> +                       write_fault = true;
> +               }
>         }
>  
>         return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> 
>

> And FWIW, removing .change_pte() entirely, even without any other optimizations,
> will also benefit those guests, as it will remove a source of mmu_lock contention
> along with all of the overhead of invoking callbacks, walking memslots, etc.  And
> removing .change_pte() will benefit *all* guests by eliminating unrelated callbacks,
> i.e. callbacks when memory for the VMM takes a CoW fault.
>
If with above "always write_fault = true" solution, I think it's better.

> So yeah, unless I'm misunderstanding the bigger picture, the more I look at this,
> the more I'm against it.

Sean Christopherson Aug. 18, 2023, 1:46 p.m. UTC | #5

On Fri, Aug 18, 2023, Yan Zhao wrote:
> On Thu, Aug 17, 2023 at 10:53:25AM -0700, Sean Christopherson wrote:
> > And FWIW, removing .change_pte() entirely, even without any other optimizations,
> > will also benefit those guests, as it will remove a source of mmu_lock contention
> > along with all of the overhead of invoking callbacks, walking memslots, etc.  And
> > removing .change_pte() will benefit *all* guests by eliminating unrelated callbacks,
> > i.e. callbacks when memory for the VMM takes a CoW fault.
> >
> If with above "always write_fault = true" solution, I think it's better.

Another option would be to allow a per-mm override of use_zero_page, but I think
I like the KVM memslot route more as it provides better granularity, doesn't
prevent CoW for VMM memory, and works even if THP isn't being used.

Yan Zhao Sept. 4, 2023, 7:03 a.m. UTC | #6

...
> > Actually, I don't even completely understand how you're seeing CoW behavior in
> > the first place.  No sane guest should blindly read (or execute) uninitialized
> > memory.  IIUC, you're not running a Windows guest, and even if you are, AFAIK
> > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> > been zeroed by the hypervisor.  If KSM is to blame, then my answer it to turn off
> > KSM, because turning on KSM is antithetical to guest performance (not to mention
> > that KSM is wildly insecure for the guest, especially given the number of speculative
> > execution attacks these days).
> I'm running a linux guest.
> KSM is not turned on both in guest and host.
> Both guest and host have turned on transparent huge page.
> 
> The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
> which will cause
>     (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
>         mapped.
>     (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
>         which will fail
> 
> The guest then writes the GFN.
> This step will trigger (huge pmd split for huge page case) and .change_pte().
> 
> My guest is surely a sane guest. But currently I can't find out why
> certain pages are read before write.
> Will return back to you the reason after figuring it out after my long vacation.
Finally I figured out the reason.

Except 4 pages were read before written from vBIOS (I just want to skip finding
out why vBIOS does this), the remaining thousands of pages were read before
written from the guest Linux kernel.

If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or
"CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or
"init_on_free=1", this read before written problem goes away.

However, turning on those configs has side effects as said in kernel config
message:
"all page allocator and slab allocator memory will be zeroed when allocated,
eliminating many kinds of "uninitialized heap memory" flaws, especially
heap content exposures. The performance impact varies by workload, but most
cases see <1% impact. Some synthetic workloads have measured as high as 7%."

If without the above two configs, or if with init_on_alloc=0 && init_on_free=0,
the root cause for all the reads of uninitialized heap memory are related to
page cache pages of the guest virtual devices (specifically the virtual IDE
device in my case).

When the guest kernel triggers a guest block device read ahead, pages are
allocated as page cache pages, and requests to read disk data into the page
cache are issued.

The disk data read requests will cause dma_direct_map_page() called if vIOMMU
is not enabled. Then, because the virtual IDE device can only direct access
32-bit DMA address (equal to GPA) at maximum, swiotlb will be used as DMA
bounce if page cache pages are with GPA > 32 bits.

Then the call path is
dma_direct_map_page() --> swiotlb_map() -->swiotlb_tbl_map_single()

In swiotlb_tbl_map_single(), though DMA direction is DMA_FROM_DEVICE,
this swiotlb_tbl_map_single() will call
swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE) to read page cache
content to the bounce buffer first.
Then, during device DMAs, device content is DMAed into the bounce buffer.
After that, the bounce buffer data will be copied back to the page cache page
after calling dma_direct_unmap_page() --> swiotlb_tbl_unmap_single().


IOW, before reading ahead device data into the page cache, the page cache is
read and copied to the bounce buffer, though an immediate write is followed to
copy bounce buffer data back to the page cache.

This explains why it's observed in host that most pages are written immediately
after it's read, and .change_pte() occurs heavily during guest boot-up and
nested guest boot-up, -- when disk readahead happens abundantly.

The reason for this unconditional read of page into bounce buffer
(caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)")
is explained in the code:

/*
 * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
 * to the tlb buffer, if we knew for sure the device will
 * overwrite the entire current content. But we don't. Thus
 * unconditional bounce may prevent leaking swiotlb content (i.e.
 * kernel memory) to user-space.
 */

If we neglect this risk and do changes like
-       swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
+       if (dir != DMA_FROM_DEVICE)
+               swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);

the issue of pages read before written from guest kernel just went away.

I don't think it's a swiotlb bug, because to prevent leaking swiotlb
content, if target page content is not copied firstly to the swiotlb's
bounce buffer, then the bounce buffer needs to be initialized to 0.
However, swiotlb_tbl_map_single() does not know whether the target page
is initialized or not. Then, it would cause page content to be trimmed
if device does not overwrite the entire memory.

> 
> > 
> > If there's something else going on, i.e. if your VM really is somehow generating
> > reads before writes, and if we really want to optimize use cases that can't use
> > hugepages for whatever reason, I would much prefer to do something like add a
> > memslot flag to state that the memslot should *always* be mapped writable.  Because
> Will check if this flag is necessary after figuring out the reason.
As explained above, I think it's a valid and non-rare practice in guest kernel to
cause read of uninitialized heap memory. And the host admin may not know
exactly when it's appropriate to apply the memslot flag.

Do you think it's good to make the "always write_fault = true" solution enabled by default?

Sean Christopherson Sept. 5, 2023, 6:59 p.m. UTC | #7

+swiotbl maintainers and Linus

Spinning off a discussion about swiotlb behavior to its own thread.

Quick background: when running Linux as a KVM guest, Yan observed memory accesses
where Linux is reading completely uninitialized memory (never been written by the
guest) and traced it back to this code in the swiotlb:

	/*
	 * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
	 * to the tlb buffer, if we knew for sure the device will
	 * overwrite the entire current content. But we don't. Thus
	 * unconditional bounce may prevent leaking swiotlb content (i.e.
	 * kernel memory) to user-space.
	 */
	swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);

The read-before-write behavior results in suboptimal performance as KVM maps the
memory as read-only, and then triggers CoW when the guest inevitably writes the
memory (CoW is significantly more expensive when KVM is involved).

There are a variety of ways to workaround this in KVM, but even if we decide to
address this in KVM, the swiotlb behavior is sketchy.  Not to mention that any
KVM changes are highly unlikely to be backported to LTS kernels.

On Mon, Sep 04, 2023, Yan Zhao wrote:
> ...
> > > Actually, I don't even completely understand how you're seeing CoW behavior in
> > > the first place.  No sane guest should blindly read (or execute) uninitialized
> > > memory.  IIUC, you're not running a Windows guest, and even if you are, AFAIK
> > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> > > been zeroed by the hypervisor.  If KSM is to blame, then my answer it to turn off
> > > KSM, because turning on KSM is antithetical to guest performance (not to mention
> > > that KSM is wildly insecure for the guest, especially given the number of speculative
> > > execution attacks these days).
> > I'm running a linux guest.
> > KSM is not turned on both in guest and host.
> > Both guest and host have turned on transparent huge page.
> > 
> > The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
> > which will cause
> >     (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
> >         mapped.
> >     (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
> >         which will fail
> > 
> > The guest then writes the GFN.
> > This step will trigger (huge pmd split for huge page case) and .change_pte().
> > 
> > My guest is surely a sane guest. But currently I can't find out why
> > certain pages are read before write.
> > Will return back to you the reason after figuring it out after my long vacation.
> Finally I figured out the reason.
> 
> Except 4 pages were read before written from vBIOS (I just want to skip finding
> out why vBIOS does this), the remaining thousands of pages were read before
> written from the guest Linux kernel.

...

> When the guest kernel triggers a guest block device read ahead, pages are
> allocated as page cache pages, and requests to read disk data into the page
> cache are issued.
> 
> The disk data read requests will cause dma_direct_map_page() called if vIOMMU
> is not enabled. Then, because the virtual IDE device can only direct access
> 32-bit DMA address (equal to GPA) at maximum, swiotlb will be used as DMA
> bounce if page cache pages are with GPA > 32 bits.
>
> Then the call path is
> dma_direct_map_page() --> swiotlb_map() -->swiotlb_tbl_map_single()
> 
> In swiotlb_tbl_map_single(), though DMA direction is DMA_FROM_DEVICE,
> this swiotlb_tbl_map_single() will call
> swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE) to read page cache
> content to the bounce buffer first.
> Then, during device DMAs, device content is DMAed into the bounce buffer.
> After that, the bounce buffer data will be copied back to the page cache page
> after calling dma_direct_unmap_page() --> swiotlb_tbl_unmap_single().
> 
> 
> IOW, before reading ahead device data into the page cache, the page cache is
> read and copied to the bounce buffer, though an immediate write is followed to
> copy bounce buffer data back to the page cache.
> 
> This explains why it's observed in host that most pages are written immediately
> after it's read, and .change_pte() occurs heavily during guest boot-up and
> nested guest boot-up, -- when disk readahead happens abundantly.
> 
> The reason for this unconditional read of page into bounce buffer
> (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)")
> is explained in the code:
> 
> /*
>  * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
>  * to the tlb buffer, if we knew for sure the device will
>  * overwrite the entire current content. But we don't. Thus
>  * unconditional bounce may prevent leaking swiotlb content (i.e.
>  * kernel memory) to user-space.
>  */
> 
> If we neglect this risk and do changes like
> -       swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> +       if (dir != DMA_FROM_DEVICE)
> +               swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> 
> the issue of pages read before written from guest kernel just went away.
> 
> I don't think it's a swiotlb bug, because to prevent leaking swiotlb
> content, if target page content is not copied firstly to the swiotlb's
> bounce buffer, then the bounce buffer needs to be initialized to 0.
> However, swiotlb_tbl_map_single() does not know whether the target page
> is initialized or not. Then, it would cause page content to be trimmed
> if device does not overwrite the entire memory.

The math doesn't add up though.  Observing a read-before-write means the intended
goal of preventing data leaks to userspace is not being met.  I.e. instead of
leaking old swiotlb, the kernel is (theoretically) leaking whatever data is in the
original page (page cache in your case).

That data *may* be completely uninitialized, especially during boot, but the
original pages may also contain data from whatever was using the pages before they
were allocated by the driver.

It's possible the read of uninitialized data is observed only when either the
driver that triggered the mapping _knows_ that the device will overwrite the entire
mapping, or the driver will consume only the written parts.  But in those cases,
copying from the original memory is completely pointless.

If neither of the above holds true, then copying from the original adds value only
if preserving the data is necessary for functional correctness, or the driver
explicitly initialized the original memory.  Given that preserving source data was
recently added, I highly doubt it's necessary for functional correctness.

And if the driver *doesn't* initialize the data, then the copy is at best pointless,
and possibly even worse than leaking stale swiotlb data.

Looking at commit ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE"),
IIUC the data leak was observed with a synthetic test "driver" that was developed
to verify a real data leak fixed by commit a45b599ad808 ("scsi: sg: allocate with
__GFP_ZERO in sg_build_indirect()").  Which basically proves my point: copying
from the source only adds value absent a bug in the owning driver.

IMO, rather than copying from the original memory, swiotlb_tbl_map_single() should
simply zero the original page(s) when establishing the mapping.  That would harden
all usage of swiotlb and avoid the read-before-write behavior that is problematic
for KVM.

Linus Torvalds Sept. 5, 2023, 7:30 p.m. UTC | #8

On Tue, 5 Sept 2023 at 11:59, Sean Christopherson <seanjc@google.com> wrote:
>
> IMO, rather than copying from the original memory, swiotlb_tbl_map_single() should
> simply zero the original page(s) when establishing the mapping.  That would harden
> all usage of swiotlb and avoid the read-before-write behavior that is problematic
> for KVM.

I don't disagree, but the argument at the time (I think from
Christoph, but I might be barking entirely up the wrong tree) was that
the swiotlb behavior should match hardware DMA, and we had known cases
where the hardware only did partial IO.

Honestly, even pre-zeroing sounds silly (and *much* too expensive,
even if less so than copying), and we'd be better off if we only zero
the rest once we have seen a partial DMA result, but I don't actually
know if we have that partial size knowledge.

            Linus

Sean Christopherson Sept. 5, 2023, 8:18 p.m. UTC | #9

On Mon, Sep 04, 2023, Yan Zhao wrote:
> ...
> > > Actually, I don't even completely understand how you're seeing CoW behavior in
> > > the first place.  No sane guest should blindly read (or execute) uninitialized
> > > memory.  IIUC, you're not running a Windows guest, and even if you are, AFAIK
> > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> > > been zeroed by the hypervisor.  If KSM is to blame, then my answer it to turn off
> > > KSM, because turning on KSM is antithetical to guest performance (not to mention
> > > that KSM is wildly insecure for the guest, especially given the number of speculative
> > > execution attacks these days).
> > I'm running a linux guest.
> > KSM is not turned on both in guest and host.
> > Both guest and host have turned on transparent huge page.
> > 
> > The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
> > which will cause
> >     (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
> >         mapped.
> >     (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
> >         which will fail
> > 
> > The guest then writes the GFN.
> > This step will trigger (huge pmd split for huge page case) and .change_pte().
> > 
> > My guest is surely a sane guest. But currently I can't find out why
> > certain pages are read before write.
> > Will return back to you the reason after figuring it out after my long vacation.
> Finally I figured out the reason.
> 
> Except 4 pages were read before written from vBIOS (I just want to skip finding
> out why vBIOS does this), the remaining thousands of pages were read before
> written from the guest Linux kernel.
> 
> If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or
> "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or
> "init_on_free=1", this read before written problem goes away.
> 
> However, turning on those configs has side effects as said in kernel config
> message:
> "all page allocator and slab allocator memory will be zeroed when allocated,
> eliminating many kinds of "uninitialized heap memory" flaws, especially
> heap content exposures. The performance impact varies by workload, but most
> cases see <1% impact. Some synthetic workloads have measured as high as 7%."
> 
> If without the above two configs, or if with init_on_alloc=0 && init_on_free=0,
> the root cause for all the reads of uninitialized heap memory are related to

Yeah, forcing the guest to pre-initialize all memory is a hack-a-fix and not a
real solution.

> page cache pages of the guest virtual devices (specifically the virtual IDE
> device in my case).

Why are you using IDE?  IDE is comically slow compared to VirtIO, and VirtIO has
been broadly supported for something like 15 years, even on Windows.

> The reason for this unconditional read of page into bounce buffer
> (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)")
> is explained in the code:
> 
> /*
>  * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
>  * to the tlb buffer, if we knew for sure the device will
>  * overwrite the entire current content. But we don't. Thus
>  * unconditional bounce may prevent leaking swiotlb content (i.e.
>  * kernel memory) to user-space.
>  */
> 
> If we neglect this risk and do changes like
> -       swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> +       if (dir != DMA_FROM_DEVICE)
> +               swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> 
> the issue of pages read before written from guest kernel just went away.
> 
> I don't think it's a swiotlb bug, because to prevent leaking swiotlb
> content, if target page content is not copied firstly to the swiotlb's
> bounce buffer, then the bounce buffer needs to be initialized to 0.
> However, swiotlb_tbl_map_single() does not know whether the target page
> is initialized or not. Then, it would cause page content to be trimmed
> if device does not overwrite the entire memory.
> 
> > 
> > > 
> > > If there's something else going on, i.e. if your VM really is somehow generating
> > > reads before writes, and if we really want to optimize use cases that can't use
> > > hugepages for whatever reason, I would much prefer to do something like add a
> > > memslot flag to state that the memslot should *always* be mapped writable.  Because
> > Will check if this flag is necessary after figuring out the reason.
> As explained above, I think it's a valid and non-rare practice in guest kernel to
> cause read of uninitialized heap memory.

Heh, for some definitions of valid.  

> And the host admin may not know exactly when it's appropriate to apply the
> memslot flag.

Yeah, a memslot flag is too fine-grained.

> Do you think it's good to make the "always write_fault = true" solution enabled
> by default?

Sadly, probably not, because that would regress setups that do want to utilize
CoW, e.g. I'm pretty sure requesting everything to be writable would be a big
negative for KSM.

I do think we should add a KVM knob though.  Regardless of the validity or frequency
of the guest behavior, and even though userspace can also workaround this by
preallocating guest memory, I am struggling to think of any reason outside of KSM
where CoW semantics are desirable.

Ooh, actually, maybe we could do

	static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM);

and then cross our fingers that that doesn't regress some other funky setups.

Robin Murphy Sept. 6, 2023, 12:29 a.m. UTC | #10

On 2023-09-05 19:59, Sean Christopherson wrote:
> +swiotbl maintainers and Linus
> 
> Spinning off a discussion about swiotlb behavior to its own thread.
> 
> Quick background: when running Linux as a KVM guest, Yan observed memory accesses
> where Linux is reading completely uninitialized memory (never been written by the
> guest) and traced it back to this code in the swiotlb:
> 
> 	/*
> 	 * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
> 	 * to the tlb buffer, if we knew for sure the device will
> 	 * overwrite the entire current content. But we don't. Thus
> 	 * unconditional bounce may prevent leaking swiotlb content (i.e.
> 	 * kernel memory) to user-space.
> 	 */
> 	swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> 
> The read-before-write behavior results in suboptimal performance as KVM maps the
> memory as read-only, and then triggers CoW when the guest inevitably writes the
> memory (CoW is significantly more expensive when KVM is involved).
> 
> There are a variety of ways to workaround this in KVM, but even if we decide to
> address this in KVM, the swiotlb behavior is sketchy.  Not to mention that any
> KVM changes are highly unlikely to be backported to LTS kernels.
> 
> On Mon, Sep 04, 2023, Yan Zhao wrote:
>> ...
>>>> Actually, I don't even completely understand how you're seeing CoW behavior in
>>>> the first place.  No sane guest should blindly read (or execute) uninitialized
>>>> memory.  IIUC, you're not running a Windows guest, and even if you are, AFAIK
>>>> QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
>>>> been zeroed by the hypervisor.  If KSM is to blame, then my answer it to turn off
>>>> KSM, because turning on KSM is antithetical to guest performance (not to mention
>>>> that KSM is wildly insecure for the guest, especially given the number of speculative
>>>> execution attacks these days).
>>> I'm running a linux guest.
>>> KSM is not turned on both in guest and host.
>>> Both guest and host have turned on transparent huge page.
>>>
>>> The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
>>> which will cause
>>>      (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
>>>          mapped.
>>>      (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
>>>          which will fail
>>>
>>> The guest then writes the GFN.
>>> This step will trigger (huge pmd split for huge page case) and .change_pte().
>>>
>>> My guest is surely a sane guest. But currently I can't find out why
>>> certain pages are read before write.
>>> Will return back to you the reason after figuring it out after my long vacation.
>> Finally I figured out the reason.
>>
>> Except 4 pages were read before written from vBIOS (I just want to skip finding
>> out why vBIOS does this), the remaining thousands of pages were read before
>> written from the guest Linux kernel.
> 
> ...
> 
>> When the guest kernel triggers a guest block device read ahead, pages are
>> allocated as page cache pages, and requests to read disk data into the page
>> cache are issued.
>>
>> The disk data read requests will cause dma_direct_map_page() called if vIOMMU
>> is not enabled. Then, because the virtual IDE device can only direct access
>> 32-bit DMA address (equal to GPA) at maximum, swiotlb will be used as DMA
>> bounce if page cache pages are with GPA > 32 bits.
>>
>> Then the call path is
>> dma_direct_map_page() --> swiotlb_map() -->swiotlb_tbl_map_single()
>>
>> In swiotlb_tbl_map_single(), though DMA direction is DMA_FROM_DEVICE,
>> this swiotlb_tbl_map_single() will call
>> swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE) to read page cache
>> content to the bounce buffer first.
>> Then, during device DMAs, device content is DMAed into the bounce buffer.
>> After that, the bounce buffer data will be copied back to the page cache page
>> after calling dma_direct_unmap_page() --> swiotlb_tbl_unmap_single().
>>
>>
>> IOW, before reading ahead device data into the page cache, the page cache is
>> read and copied to the bounce buffer, though an immediate write is followed to
>> copy bounce buffer data back to the page cache.
>>
>> This explains why it's observed in host that most pages are written immediately
>> after it's read, and .change_pte() occurs heavily during guest boot-up and
>> nested guest boot-up, -- when disk readahead happens abundantly.
>>
>> The reason for this unconditional read of page into bounce buffer
>> (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)")
>> is explained in the code:
>>
>> /*
>>   * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
>>   * to the tlb buffer, if we knew for sure the device will
>>   * overwrite the entire current content. But we don't. Thus
>>   * unconditional bounce may prevent leaking swiotlb content (i.e.
>>   * kernel memory) to user-space.
>>   */
>>
>> If we neglect this risk and do changes like
>> -       swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
>> +       if (dir != DMA_FROM_DEVICE)
>> +               swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
>>
>> the issue of pages read before written from guest kernel just went away.
>>
>> I don't think it's a swiotlb bug, because to prevent leaking swiotlb
>> content, if target page content is not copied firstly to the swiotlb's
>> bounce buffer, then the bounce buffer needs to be initialized to 0.
>> However, swiotlb_tbl_map_single() does not know whether the target page
>> is initialized or not. Then, it would cause page content to be trimmed
>> if device does not overwrite the entire memory.
> 
> The math doesn't add up though.  Observing a read-before-write means the intended
> goal of preventing data leaks to userspace is not being met.  I.e. instead of
> leaking old swiotlb, the kernel is (theoretically) leaking whatever data is in the
> original page (page cache in your case).

Sure, it copies the destination page into the SWIOTLB buffer on map, 
then copies it back out again on unmap, so it's only "leaking" the 
previous contents of the page into the same page. Think of it as SWIOTLB 
doing a read-modify-write of the DMA-mapped region, because that's 
exactly what it's doing (on the basis that it can't be sure of 
overwriting it entirely).

You can then consider it the same as if the device DMAs to the page 
directly without SWIOTLB being involved - if it does a partial write and 
that lets previous contents be mapped to userspace which shouldn't have 
been, that can only be the fault of whoever failed to sanitise the page 
beforehand.

> That data *may* be completely uninitialized, especially during boot, but the
> original pages may also contain data from whatever was using the pages before they
> were allocated by the driver.
> 
> It's possible the read of uninitialized data is observed only when either the
> driver that triggered the mapping _knows_ that the device will overwrite the entire
> mapping, or the driver will consume only the written parts.  But in those cases,
> copying from the original memory is completely pointless.

Indeed, but sadly it is impossible for SWIOTLB to detect when that is or 
isn't the case. No matter what the SWIOTLB buffer is initialised with, 
there is always the possibility that the device coincidentally writes 
the same pattern of bytes, thus it cannot ever know for sure what was or 
wasn't touched between map and unmap.

This is why in the original discussion we also tossed around the idea of 
a DMA attribute for the caller to indicate "I definitely expect the 
device to overwrite this entire mapping", but then we went round in 
circles a bit more, concluded that it might be hard to get right, and 
shelved it, but then the wrong version of the patch got merged which did 
still include the half-formed attribute idea, and then there was the 
whole other trainwreck of reverting and unreverting, and I have a 
feeling that what we ended up with still wasn't quite right but by that 
point I was so fed up of the whole business I had stopped caring.

> If neither of the above holds true, then copying from the original adds value only
> if preserving the data is necessary for functional correctness, or the driver
> explicitly initialized the original memory.  Given that preserving source data was
> recently added, I highly doubt it's necessary for functional correctness.

Seems kinda hard to say that with certainty - I would imagine that the 
amount of actual testing done with "swiotlb=force" is minimal at best. 
There are so many 64-bit-capable devices in the world which would never 
normally interact with SWIOTLB, but are liable to find themselves forced 
into doing so in future confidential compute etc. scenarios. I don't 
think I'd bet anything of value that *nothing* anywhere is depending on 
partial DMA writes being handled correctly.

> And if the driver *doesn't* initialize the data, then the copy is at best pointless,
> and possibly even worse than leaking stale swiotlb data.

Other than the overhead, done right it can't be any worse than if 
SWIOTLB were not involved at all.

> Looking at commit ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE"),
> IIUC the data leak was observed with a synthetic test "driver" that was developed
> to verify a real data leak fixed by commit a45b599ad808 ("scsi: sg: allocate with
> __GFP_ZERO in sg_build_indirect()").  Which basically proves my point: copying
> from the source only adds value absent a bug in the owning driver.

Huh? IIUC the bug there was that the SCSI layer failed to sanitise 
partially-written buffers. That bug was fixed, and the scrutiny therein 
happened to reveal that SWIOTLB *also* had a lower-level problem with 
partial writes, in that it was corrupting DMA-mapped memory which was 
not updated by the device. Partial DMA writes are not in themselves 
indicative of a bug, they may well be a valid and expected behaviour.

> IMO, rather than copying from the original memory, swiotlb_tbl_map_single() should
> simply zero the original page(s) when establishing the mapping.  That would harden
> all usage of swiotlb and avoid the read-before-write behavior that is problematic
> for KVM.

Depends on one's exact definition of "harden"... Corrupting memory with 
zeros is less bad than corrupting memory with someone else's data if you 
look at it from an information security point of view, but from a 
not-corrupting-memory point of view it's definitely still corrupting 
memory :/

Taking a step back, is there not an argument that if people care about 
general KVM performance then they should maybe stop emulating obsolete 
PC hardware from 30 years ago, and at least emulate obsolete PC hardware 
from 20 years ago that supports 64-bit DMA? Even non-virtualised, 
SWIOTLB is pretty horrible for I/O performance by its very nature - 
avoiding it if at all possible should always be preferred.

Thanks,
Robin.

Yan Zhao Sept. 6, 2023, 1:51 a.m. UTC | #11

On Tue, Sep 05, 2023 at 01:18:23PM -0700, Sean Christopherson wrote:
> On Mon, Sep 04, 2023, Yan Zhao wrote:
> > ...
> > > > Actually, I don't even completely understand how you're seeing CoW behavior in
> > > > the first place.  No sane guest should blindly read (or execute) uninitialized
> > > > memory.  IIUC, you're not running a Windows guest, and even if you are, AFAIK
> > > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> > > > been zeroed by the hypervisor.  If KSM is to blame, then my answer it to turn off
> > > > KSM, because turning on KSM is antithetical to guest performance (not to mention
> > > > that KSM is wildly insecure for the guest, especially given the number of speculative
> > > > execution attacks these days).
> > > I'm running a linux guest.
> > > KSM is not turned on both in guest and host.
> > > Both guest and host have turned on transparent huge page.
> > > 
> > > The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
> > > which will cause
> > >     (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
> > >         mapped.
> > >     (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
> > >         which will fail
> > > 
> > > The guest then writes the GFN.
> > > This step will trigger (huge pmd split for huge page case) and .change_pte().
> > > 
> > > My guest is surely a sane guest. But currently I can't find out why
> > > certain pages are read before write.
> > > Will return back to you the reason after figuring it out after my long vacation.
> > Finally I figured out the reason.
> > 
> > Except 4 pages were read before written from vBIOS (I just want to skip finding
> > out why vBIOS does this), the remaining thousands of pages were read before
> > written from the guest Linux kernel.
> > 
> > If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or
> > "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or
> > "init_on_free=1", this read before written problem goes away.
> > 
> > However, turning on those configs has side effects as said in kernel config
> > message:
> > "all page allocator and slab allocator memory will be zeroed when allocated,
> > eliminating many kinds of "uninitialized heap memory" flaws, especially
> > heap content exposures. The performance impact varies by workload, but most
> > cases see <1% impact. Some synthetic workloads have measured as high as 7%."
> > 
> > If without the above two configs, or if with init_on_alloc=0 && init_on_free=0,
> > the root cause for all the reads of uninitialized heap memory are related to
> 
> Yeah, forcing the guest to pre-initialize all memory is a hack-a-fix and not a
> real solution.
> 
> > page cache pages of the guest virtual devices (specifically the virtual IDE
> > device in my case).
> 
> Why are you using IDE?  IDE is comically slow compared to VirtIO, and VirtIO has
> been broadly supported for something like 15 years, even on Windows.

I don't know why I'm using IDE.
Maybe just because it's of my default settings for years :)

And I sent this series was just because I think each guest write in this case has
to cause two EPT violations is wasted.

BTW, not only for IDE devices, I think any devices with DMA mask less than max GPA
width will cause the same problem. And also true when "swiotlb=force" is enabled.

> 
> > The reason for this unconditional read of page into bounce buffer
> > (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)")
> > is explained in the code:
> > 
> > /*
> >  * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
> >  * to the tlb buffer, if we knew for sure the device will
> >  * overwrite the entire current content. But we don't. Thus
> >  * unconditional bounce may prevent leaking swiotlb content (i.e.
> >  * kernel memory) to user-space.
> >  */
> > 
> > If we neglect this risk and do changes like
> > -       swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> > +       if (dir != DMA_FROM_DEVICE)
> > +               swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> > 
> > the issue of pages read before written from guest kernel just went away.
> > 
> > I don't think it's a swiotlb bug, because to prevent leaking swiotlb
> > content, if target page content is not copied firstly to the swiotlb's
> > bounce buffer, then the bounce buffer needs to be initialized to 0.
> > However, swiotlb_tbl_map_single() does not know whether the target page
> > is initialized or not. Then, it would cause page content to be trimmed
> > if device does not overwrite the entire memory.
> > 
> > > 
> > > > 
> > > > If there's something else going on, i.e. if your VM really is somehow generating
> > > > reads before writes, and if we really want to optimize use cases that can't use
> > > > hugepages for whatever reason, I would much prefer to do something like add a
> > > > memslot flag to state that the memslot should *always* be mapped writable.  Because
> > > Will check if this flag is necessary after figuring out the reason.
> > As explained above, I think it's a valid and non-rare practice in guest kernel to
> > cause read of uninitialized heap memory.
> 
> Heh, for some definitions of valid.  
> 
> > And the host admin may not know exactly when it's appropriate to apply the
> > memslot flag.
> 
> Yeah, a memslot flag is too fine-grained.
> 
> > Do you think it's good to make the "always write_fault = true" solution enabled
> > by default?
> 
> Sadly, probably not, because that would regress setups that do want to utilize
> CoW, e.g. I'm pretty sure requesting everything to be writable would be a big
> negative for KSM.
> 
> I do think we should add a KVM knob though.  Regardless of the validity or frequency
> of the guest behavior, and even though userspace can also workaround this by
> preallocating guest memory, I am struggling to think of any reason outside of KSM
> where CoW semantics are desirable.
> 
> Ooh, actually, maybe we could do
> 
> 	static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM);
> 
> and then cross our fingers that that doesn't regress some other funky setups.
Oh, this "always write_fault" solution may be not friendly to UFFD write protected pages too.

Sean Christopherson Sept. 6, 2023, 2:44 p.m. UTC | #12

On Wed, Sep 06, 2023, Robin Murphy wrote:
> On 2023-09-05 19:59, Sean Christopherson wrote:
> > And if the driver *doesn't* initialize the data, then the copy is at best pointless,
> > and possibly even worse than leaking stale swiotlb data.
> 
> Other than the overhead, done right it can't be any worse than if SWIOTLB
> were not involved at all.

Yep.

> > Looking at commit ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE"),
> > IIUC the data leak was observed with a synthetic test "driver" that was developed
> > to verify a real data leak fixed by commit a45b599ad808 ("scsi: sg: allocate with
> > __GFP_ZERO in sg_build_indirect()").  Which basically proves my point: copying
> > from the source only adds value absent a bug in the owning driver.
> 
> Huh? IIUC the bug there was that the SCSI layer failed to sanitise
> partially-written buffers. That bug was fixed, and the scrutiny therein
> happened to reveal that SWIOTLB *also* had a lower-level problem with
> partial writes, in that it was corrupting DMA-mapped memory which was not
> updated by the device. Partial DMA writes are not in themselves indicative
> of a bug, they may well be a valid and expected behaviour.

The problem is that the comment only talks about leaking data to userspace, and
doesn't say anything about data corruption or the "swiotlb needs to match hardware"
justification that Linus pointed out.  I buy both of those arguments for copying
data from the original page, but the "may prevent leaking swiotlb content" is IMO
completely nonsensical, because if preventing leakage is the only goal, then
explicitly initializing the memory is better in every way.

If no one objects, I'll put together a patch to rewrite the comment in terms of
mimicking hardware and not corrupting the caller's data.

> > IMO, rather than copying from the original memory, swiotlb_tbl_map_single() should
> > simply zero the original page(s) when establishing the mapping.  That would harden
> > all usage of swiotlb and avoid the read-before-write behavior that is problematic
> > for KVM.
> 
> Depends on one's exact definition of "harden"... Corrupting memory with
> zeros is less bad than corrupting memory with someone else's data if you
> look at it from an information security point of view, but from a
> not-corrupting-memory point of view it's definitely still corrupting memory
> :/
> 
> Taking a step back, is there not an argument that if people care about
> general KVM performance then they should maybe stop emulating obsolete PC
> hardware from 30 years ago, and at least emulate obsolete PC hardware from
> 20 years ago that supports 64-bit DMA?

Heh, I don't think there's an argument per se, people most definitely shouldn't
be emulating old hardware if they care about performance.  I already told Yan as
much.

> Even non-virtualised, SWIOTLB is pretty horrible for I/O performance by its
> very nature - avoiding it if at all possible should always be preferred.

Yeah.  The main reason I didn't just sweep this under the rug is the confidential
VM use case, where SWIOTLB is used to bounce data from guest private memory into
shread buffers.  There's also a good argument that anyone that cares about I/O
performance in confidential VMs should put in the effort to enlighten their device
drivers to use shared memory directly, but practically speaking that's easier said
than done.

Robin Murphy Sept. 6, 2023, 4:18 p.m. UTC | #13

On 2023-09-06 15:44, Sean Christopherson wrote:
> On Wed, Sep 06, 2023, Robin Murphy wrote:
>> On 2023-09-05 19:59, Sean Christopherson wrote:
>>> And if the driver *doesn't* initialize the data, then the copy is at best pointless,
>>> and possibly even worse than leaking stale swiotlb data.
>>
>> Other than the overhead, done right it can't be any worse than if SWIOTLB
>> were not involved at all.
> 
> Yep.
> 
>>> Looking at commit ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE"),
>>> IIUC the data leak was observed with a synthetic test "driver" that was developed
>>> to verify a real data leak fixed by commit a45b599ad808 ("scsi: sg: allocate with
>>> __GFP_ZERO in sg_build_indirect()").  Which basically proves my point: copying
>>> from the source only adds value absent a bug in the owning driver.
>>
>> Huh? IIUC the bug there was that the SCSI layer failed to sanitise
>> partially-written buffers. That bug was fixed, and the scrutiny therein
>> happened to reveal that SWIOTLB *also* had a lower-level problem with
>> partial writes, in that it was corrupting DMA-mapped memory which was not
>> updated by the device. Partial DMA writes are not in themselves indicative
>> of a bug, they may well be a valid and expected behaviour.
> 
> The problem is that the comment only talks about leaking data to userspace, and
> doesn't say anything about data corruption or the "swiotlb needs to match hardware"
> justification that Linus pointed out.  I buy both of those arguments for copying
> data from the original page, but the "may prevent leaking swiotlb content" is IMO
> completely nonsensical, because if preventing leakage is the only goal, then
> explicitly initializing the memory is better in every way.
> 
> If no one objects, I'll put together a patch to rewrite the comment in terms of
> mimicking hardware and not corrupting the caller's data.

Sounds good to me. I guess the trouble is that as soon as a CVE is 
involved it can then get hard to look past it, or want to risk appearing 
to downplay it :)

>>> IMO, rather than copying from the original memory, swiotlb_tbl_map_single() should
>>> simply zero the original page(s) when establishing the mapping.  That would harden
>>> all usage of swiotlb and avoid the read-before-write behavior that is problematic
>>> for KVM.
>>
>> Depends on one's exact definition of "harden"... Corrupting memory with
>> zeros is less bad than corrupting memory with someone else's data if you
>> look at it from an information security point of view, but from a
>> not-corrupting-memory point of view it's definitely still corrupting memory
>> :/
>>
>> Taking a step back, is there not an argument that if people care about
>> general KVM performance then they should maybe stop emulating obsolete PC
>> hardware from 30 years ago, and at least emulate obsolete PC hardware from
>> 20 years ago that supports 64-bit DMA?
> 
> Heh, I don't think there's an argument per se, people most definitely shouldn't
> be emulating old hardware if they care about performance.  I already told Yan as
> much.
> 
>> Even non-virtualised, SWIOTLB is pretty horrible for I/O performance by its
>> very nature - avoiding it if at all possible should always be preferred.
> 
> Yeah.  The main reason I didn't just sweep this under the rug is the confidential
> VM use case, where SWIOTLB is used to bounce data from guest private memory into
> shread buffers.  There's also a good argument that anyone that cares about I/O
> performance in confidential VMs should put in the effort to enlighten their device
> drivers to use shared memory directly, but practically speaking that's easier said
> than done.

Indeed a bunch of work has gone into SWIOTLB recently trying to make it 
a bit more efficient for such cases where it can't be avoided, so it is 
definitely still interesting to learn about impacts at other levels like 
this. Maybe there's a bit of a get-out for confidential VMs though, 
since presumably there's not much point COW-ing encrypted private 
memory, so perhaps KVM might end up wanting to optimise that out and 
thus happen to end up less sensitive to unavoidable SWIOTLB behaviour 
anyway?

Cheers,
Robin.

Sean Christopherson Sept. 6, 2023, 4:46 p.m. UTC | #14

On Wed, Sep 06, 2023, Robin Murphy wrote:
> On 2023-09-06 15:44, Sean Christopherson wrote:
> > On Wed, Sep 06, 2023, Robin Murphy wrote:
> > > Even non-virtualised, SWIOTLB is pretty horrible for I/O performance by its
> > > very nature - avoiding it if at all possible should always be preferred.
> > 
> > Yeah.  The main reason I didn't just sweep this under the rug is the confidential
> > VM use case, where SWIOTLB is used to bounce data from guest private memory into
> > shread buffers.  There's also a good argument that anyone that cares about I/O
> > performance in confidential VMs should put in the effort to enlighten their device
> > drivers to use shared memory directly, but practically speaking that's easier said
> > than done.
> 
> Indeed a bunch of work has gone into SWIOTLB recently trying to make it a
> bit more efficient for such cases where it can't be avoided, so it is
> definitely still interesting to learn about impacts at other levels like
> this. Maybe there's a bit of a get-out for confidential VMs though, since
> presumably there's not much point COW-ing encrypted private memory, so
> perhaps KVM might end up wanting to optimise that out and thus happen to end
> up less sensitive to unavoidable SWIOTLB behaviour anyway?

CoW should be a non-issue for confidential VMs, at least on x86.  SEV and SEV-ES
are effectively forced to pin memory as writable before it can be mapped into the
guest.  TDX and SNP and will have a different implementation, but similar behavior.

Confidential VMs would benefit purely by either eliminating or reducing the cost
of "initializing" memory, i.e. by eliminating the memcpy() or replacing it with a
memset().  I most definitely don't care enough about confidential VM I/O performance
to try and micro-optimize that behavior, their existence was simply what made me
look more closely instead of just telling Yan to stop using IDE :-)

Paolo Bonzini Sept. 6, 2023, 10:17 p.m. UTC | #15

On Tue, Sep 5, 2023 at 10:18 PM Sean Christopherson <seanjc@google.com> wrote:
> Ooh, actually, maybe we could do
>
>         static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM);
>
> and then cross our fingers that that doesn't regress some other funky setups.

It probably breaks gvisor-like setups that use MAP_PRIVATE mmap for
memslots? It would instantly break CoW even if memory is never
written.

Paolo

Yan Zhao Sept. 7, 2023, 12:36 a.m. UTC | #16

On Wed, Sep 06, 2023 at 05:51:36PM -0700, Sean Christopherson wrote:
> On Thu, Sep 07, 2023, Paolo Bonzini wrote:
> > On Tue, Sep 5, 2023 at 10:18 PM Sean Christopherson <seanjc@google.com> wrote:
> > > Ooh, actually, maybe we could do
> > >
> > >         static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM);
> > >
> > > and then cross our fingers that that doesn't regress some other funky setups.
> > 
> > It probably breaks gvisor-like setups that use MAP_PRIVATE mmap for
> > memslots? It would instantly break CoW even if memory is never
> > written.
> 
> Doh, I completely forgot about gvisor and the like.
> 
> Yan, I don't think this is worth pursuing.  My understanding is that only legacy,
> relatively slow devices need DMA32.  And as Robin pointed out, swiotlb=force isn't
> something that's likely deployed and certainly isn't intended for performance
> sensitive environments.

Yes. Then will you reconsider my patch 2?
I think it can remove the second EPT violation for each page write and
doesn't break KSM, and COW.

Sean Christopherson Sept. 7, 2023, 12:51 a.m. UTC | #17

On Thu, Sep 07, 2023, Paolo Bonzini wrote:
> On Tue, Sep 5, 2023 at 10:18 PM Sean Christopherson <seanjc@google.com> wrote:
> > Ooh, actually, maybe we could do
> >
> >         static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM);
> >
> > and then cross our fingers that that doesn't regress some other funky setups.
> 
> It probably breaks gvisor-like setups that use MAP_PRIVATE mmap for
> memslots? It would instantly break CoW even if memory is never
> written.

Doh, I completely forgot about gvisor and the like.

Yan, I don't think this is worth pursuing.  My understanding is that only legacy,
relatively slow devices need DMA32.  And as Robin pointed out, swiotlb=force isn't
something that's likely deployed and certainly isn't intended for performance
sensitive environments.

Christoph Hellwig Sept. 8, 2023, 8:18 a.m. UTC | #18

On Wed, Sep 06, 2023 at 05:18:51PM +0100, Robin Murphy wrote:
> Indeed a bunch of work has gone into SWIOTLB recently trying to make it a 
> bit more efficient for such cases where it can't be avoided, so it is 
> definitely still interesting to learn about impacts at other levels like 
> this. Maybe there's a bit of a get-out for confidential VMs though, since 
> presumably there's not much point COW-ing encrypted private memory, so 
> perhaps KVM might end up wanting to optimise that out and thus happen to 
> end up less sensitive to unavoidable SWIOTLB behaviour anyway?

Well, the fix for bounce buffering is to trust the device, and there is
a lot of work going into device authentication and attesttion right now
so that will happen.

On the swiotlb side a new version of the dma_sync_*_device APIs that
specifies the mapping len and the data length transfer would avoid
some of the overhead here.  We've decided that it is a good idea last
time, but so far no one has volunteers to implement it.

[0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU

Message

Comments