Message ID | 20230808085056.14644-1-yan.y.zhao@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: x86/mmu: .change_pte() optimization in TDP MMU | expand |
On Tue, Aug 08, 2023, Yan Zhao wrote: > This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU > (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching > notified new PFN into SPTEs directly in the handler. > > As in [1], .change_pte() has been dead code on x86 for 10+ years. > Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare > for optimization in TDP MMU in patch 2. If we're going to officially kill the long-dead attempt at optimizing KSM, I'd strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all architectures and remove it from mmu_notifiers. The only reason I haven't proposed such patches is because I didn't want to it to backfire and lead to someone trying to resurrect the optimizations for KSM. > Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the > handler directly with PFN info contained in .change_pte() to avoid that > each vCPU write that triggers .change_pte() must undergo twice VMExits and > TDP page faults. IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by using QEMU's prealloc option. It's more coarse grained, but at a minimum it's sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB. And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to prefault memory not just into the primary MMU, but also into KVM's MMU. Such an ioctl() is basically manadatory for TDX, we just need to morph the support being added by TDX into a generic ioctl()[*] Prefaulting guest memory as writable into the primary MMU should be able to achieve far better performance than hooking .change_pte(), as it will avoid the mmu_notifier invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote TLB flush(es). And a KVM ioctl() to prefault into KVM's MMU should eliminate page fault VM-Exits entirely. Explicit prefaulting isn't perfect, but IMO the value added by prefetching in .change_pte() isn't enough to justify carrying the hook and the code in KVM. [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com
On Wed, Aug 16, 2023 at 11:18:03AM -0700, Sean Christopherson wrote: > On Tue, Aug 08, 2023, Yan Zhao wrote: > > This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU > > (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching > > notified new PFN into SPTEs directly in the handler. > > > > As in [1], .change_pte() has been dead code on x86 for 10+ years. > > Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare > > for optimization in TDP MMU in patch 2. > > If we're going to officially kill the long-dead attempt at optimizing KSM, I'd > strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all > architectures and remove it from mmu_notifiers. The only reason I haven't proposed > such patches is because I didn't want to it to backfire and lead to someone trying > to resurrect the optimizations for KSM. > > > Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the > > handler directly with PFN info contained in .change_pte() to avoid that > > each vCPU write that triggers .change_pte() must undergo twice VMExits and > > TDP page faults. > > IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by > using QEMU's prealloc option. It's more coarse grained, but at a minimum it's > sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB. > > And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to > prefault memory not just into the primary MMU, but also into KVM's MMU. Such an > ioctl() is basically manadatory for TDX, we just need to morph the support being > added by TDX into a generic ioctl()[*] > > Prefaulting guest memory as writable into the primary MMU should be able to achieve > far better performance than hooking .change_pte(), as it will avoid the mmu_notifier > invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote > TLB flush(es). And a KVM ioctl() to prefault into KVM's MMU should eliminate page > fault VM-Exits entirely. > > Explicit prefaulting isn't perfect, but IMO the value added by prefetching in > .change_pte() isn't enough to justify carrying the hook and the code in KVM. > > [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com Hi Sean, As I didn't write the full picture of patch 2 in the cover letter well, may I request you to take a look of patch 2 to see if you like it? (in case if you just read the cover letter). What I observed is that each vCPU write to a COW page in primary MMU will lead to twice TDP page faults. Then, I just update the secondary MMU during the first TDP page fault to avoid the second one. It's not a blind prefetch (I checked the vCPU to ensure it's triggered by a vCPU operation as much as possible) and it can benefit guests who doesn't explicitly request a prefault memory as write. Thanks Yan
On Thu, Aug 17, 2023, Yan Zhao wrote: > On Wed, Aug 16, 2023 at 11:18:03AM -0700, Sean Christopherson wrote: > > On Tue, Aug 08, 2023, Yan Zhao wrote: > > > This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU > > > (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching > > > notified new PFN into SPTEs directly in the handler. > > > > > > As in [1], .change_pte() has been dead code on x86 for 10+ years. > > > Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare > > > for optimization in TDP MMU in patch 2. > > > > If we're going to officially kill the long-dead attempt at optimizing KSM, I'd > > strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all > > architectures and remove it from mmu_notifiers. The only reason I haven't proposed > > such patches is because I didn't want to it to backfire and lead to someone trying > > to resurrect the optimizations for KSM. > > > > > Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the > > > handler directly with PFN info contained in .change_pte() to avoid that > > > each vCPU write that triggers .change_pte() must undergo twice VMExits and > > > TDP page faults. > > > > IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by > > using QEMU's prealloc option. It's more coarse grained, but at a minimum it's > > sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB. > > > > And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to > > prefault memory not just into the primary MMU, but also into KVM's MMU. Such an > > ioctl() is basically manadatory for TDX, we just need to morph the support being > > added by TDX into a generic ioctl()[*] > > > > Prefaulting guest memory as writable into the primary MMU should be able to achieve > > far better performance than hooking .change_pte(), as it will avoid the mmu_notifier > > invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote > > TLB flush(es). And a KVM ioctl() to prefault into KVM's MMU should eliminate page > > fault VM-Exits entirely. > > > > Explicit prefaulting isn't perfect, but IMO the value added by prefetching in > > .change_pte() isn't enough to justify carrying the hook and the code in KVM. > > > > [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com > Hi Sean, > As I didn't write the full picture of patch 2 in the cover letter well, > may I request you to take a look of patch 2 to see if you like it? (in > case if you just read the cover letter). I read patch two, I replied to the cover letter as I wanted to discuss the two patches together since implementing the CoW optimization effectively means dropping the long-dead KSM optimization. > What I observed is that each vCPU write to a COW page in primary MMU > will lead to twice TDP page faults. > Then, I just update the secondary MMU during the first TDP page fault > to avoid the second one. > It's not a blind prefetch (I checked the vCPU to ensure it's triggered > by a vCPU operation as much as possible) Yes, that's part of the complexity I don't like. > and it can benefit guests who doesn't explicitly request a prefault memory as > write. Yes, I'm arguing that the benefit isn't significant, and that the use cases it might benefit aren't things people care about optimizing. I'm very skeptical that shaving those 8000 VM-Exits will translate to a meaningful reduction in guest boot time, let alone scale beyond very specific scenarios and configurations, which again, are likely suboptimal in nature. Actually, they most definitely are suboptimal, because the fact that this provides any benefit whatsoever means that either your VM isn't being backed with hugepages, or it's being backed with THP and transparent_hugepage/use_zero_page is enabled (and thus is generating CoW behavior). Enabling THP or using HugeTLB (which again can be done on a subset of guest memory) will have a far, far bigger impact on guest performance. Ditto for disabling using the huge zero_page when backing VMs with THP (any page touched by the guest is all but guaranteed to be written sooner than later, so using the zero_page doesn't make a whole lot of sense). E.g. a single CoW operation will take mmu_lock for write three times: invalidate_range_start(), change_pte(), and invalidate_range_end(), not to mention the THP zero_page CoW will first fault-in a read-only mapping, then split that mapping, and then do CoW on the 4KiB PTEs, which is *really* suboptimal. Actually, I don't even completely understand how you're seeing CoW behavior in the first place. No sane guest should blindly read (or execute) uninitialized memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off KSM, because turning on KSM is antithetical to guest performance (not to mention that KSM is wildly insecure for the guest, especially given the number of speculative execution attacks these days). If there's something else going on, i.e. if your VM really is somehow generating reads before writes, and if we really want to optimize use cases that can't use hugepages for whatever reason, I would much prefer to do something like add a memslot flag to state that the memslot should *always* be mapped writable. Because outside of setups that use KSM, the only reason I can think of to not map memory writable straightaway is if userspace somehow knows the guest isn't going to write that memory. If it weren't for KSM, and if it wouldn't potentially be a breaking change, I would even go so far as to say that KVM should always map writable memslots as writable in the guest. E.g. minus the uAPI, this is a lot simpler to implement and maintain. diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index dfbaafbe3a00..6c4640483881 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2727,10 +2727,14 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, return KVM_PFN_NOSLOT; } - /* Do not map writable pfn in the readonly memslot. */ - if (writable && memslot_is_readonly(slot)) { - *writable = false; - writable = NULL; + if (writable) { + if (memslot_is_readonly(slot)) { + *writable = false; + writable = NULL; + } else if (memslot_is_always_writable(slot)) { + *writable = true; + write_fault = true; + } } return hva_to_pfn(addr, atomic, interruptible, async, write_fault, And FWIW, removing .change_pte() entirely, even without any other optimizations, will also benefit those guests, as it will remove a source of mmu_lock contention along with all of the overhead of invoking callbacks, walking memslots, etc. And removing .change_pte() will benefit *all* guests by eliminating unrelated callbacks, i.e. callbacks when memory for the VMM takes a CoW fault. So yeah, unless I'm misunderstanding the bigger picture, the more I look at this, the more I'm against it.
On Thu, Aug 17, 2023 at 10:53:25AM -0700, Sean Christopherson wrote: ... > > What I observed is that each vCPU write to a COW page in primary MMU > > will lead to twice TDP page faults. > > Then, I just update the secondary MMU during the first TDP page fault > > to avoid the second one. > > It's not a blind prefetch (I checked the vCPU to ensure it's triggered > > by a vCPU operation as much as possible) > > Yes, that's part of the complexity I don't like. > > > and it can benefit guests who doesn't explicitly request a prefault memory as > > write. > > Yes, I'm arguing that the benefit isn't significant, and that the use cases it > might benefit aren't things people care about optimizing. > > I'm very skeptical that shaving those 8000 VM-Exits will translate to a meaningful > reduction in guest boot time, let alone scale beyond very specific scenarios and > configurations, which again, are likely suboptimal in nature. Actually, they most > definitely are suboptimal, because the fact that this provides any benefit > whatsoever means that either your VM isn't being backed with hugepages, or it's > being backed with THP and transparent_hugepage/use_zero_page is enabled (and thus > is generating CoW behavior). Yes, it's being backed with THP and transparent_hugepage/use_zero_page is enabled. > > Enabling THP or using HugeTLB (which again can be done on a subset of guest memory) > will have a far, far bigger impact on guest performance. Ditto for disabling > using the huge zero_page when backing VMs with THP (any page touched by the guest > is all but guaranteed to be written sooner than later, so using the zero_page > doesn't make a whole lot of sense). > > E.g. a single CoW operation will take mmu_lock for write three times: > invalidate_range_start(), change_pte(), and invalidate_range_end(), not to mention > the THP zero_page CoW will first fault-in a read-only mapping, then split that > mapping, and then do CoW on the 4KiB PTEs, which is *really* suboptimal. > > Actually, I don't even completely understand how you're seeing CoW behavior in > the first place. No sane guest should blindly read (or execute) uninitialized > memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has > been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off > KSM, because turning on KSM is antithetical to guest performance (not to mention > that KSM is wildly insecure for the guest, especially given the number of speculative > execution attacks these days). I'm running a linux guest. KSM is not turned on both in guest and host. Both guest and host have turned on transparent huge page. The guest first reads a GFN in a writable memslot (which is for "pc.ram"), which will cause (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn mapped. (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable, which will fail The guest then writes the GFN. This step will trigger (huge pmd split for huge page case) and .change_pte(). My guest is surely a sane guest. But currently I can't find out why certain pages are read before write. Will return back to you the reason after figuring it out after my long vacation. > > If there's something else going on, i.e. if your VM really is somehow generating > reads before writes, and if we really want to optimize use cases that can't use > hugepages for whatever reason, I would much prefer to do something like add a > memslot flag to state that the memslot should *always* be mapped writable. Because Will check if this flag is necessary after figuring out the reason. > outside of setups that use KSM, the only reason I can think of to not map memory > writable straightaway is if userspace somehow knows the guest isn't going to write > that memory. > > If it weren't for KSM, and if it wouldn't potentially be a breaking change, I > would even go so far as to say that KVM should always map writable memslots as > writable in the guest. > > E.g. minus the uAPI, this is a lot simpler to implement and maintain. > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index dfbaafbe3a00..6c4640483881 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2727,10 +2727,14 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, > return KVM_PFN_NOSLOT; > } > > - /* Do not map writable pfn in the readonly memslot. */ > - if (writable && memslot_is_readonly(slot)) { > - *writable = false; > - writable = NULL; > + if (writable) { > + if (memslot_is_readonly(slot)) { > + *writable = false; > + writable = NULL; > + } else if (memslot_is_always_writable(slot)) { > + *writable = true; > + write_fault = true; > + } > } > > return hva_to_pfn(addr, atomic, interruptible, async, write_fault, > > > And FWIW, removing .change_pte() entirely, even without any other optimizations, > will also benefit those guests, as it will remove a source of mmu_lock contention > along with all of the overhead of invoking callbacks, walking memslots, etc. And > removing .change_pte() will benefit *all* guests by eliminating unrelated callbacks, > i.e. callbacks when memory for the VMM takes a CoW fault. > If with above "always write_fault = true" solution, I think it's better. > So yeah, unless I'm misunderstanding the bigger picture, the more I look at this, > the more I'm against it.
On Fri, Aug 18, 2023, Yan Zhao wrote: > On Thu, Aug 17, 2023 at 10:53:25AM -0700, Sean Christopherson wrote: > > And FWIW, removing .change_pte() entirely, even without any other optimizations, > > will also benefit those guests, as it will remove a source of mmu_lock contention > > along with all of the overhead of invoking callbacks, walking memslots, etc. And > > removing .change_pte() will benefit *all* guests by eliminating unrelated callbacks, > > i.e. callbacks when memory for the VMM takes a CoW fault. > > > If with above "always write_fault = true" solution, I think it's better. Another option would be to allow a per-mm override of use_zero_page, but I think I like the KVM memslot route more as it provides better granularity, doesn't prevent CoW for VMM memory, and works even if THP isn't being used.
... > > Actually, I don't even completely understand how you're seeing CoW behavior in > > the first place. No sane guest should blindly read (or execute) uninitialized > > memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has > > been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off > > KSM, because turning on KSM is antithetical to guest performance (not to mention > > that KSM is wildly insecure for the guest, especially given the number of speculative > > execution attacks these days). > I'm running a linux guest. > KSM is not turned on both in guest and host. > Both guest and host have turned on transparent huge page. > > The guest first reads a GFN in a writable memslot (which is for "pc.ram"), > which will cause > (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn > mapped. > (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable, > which will fail > > The guest then writes the GFN. > This step will trigger (huge pmd split for huge page case) and .change_pte(). > > My guest is surely a sane guest. But currently I can't find out why > certain pages are read before write. > Will return back to you the reason after figuring it out after my long vacation. Finally I figured out the reason. Except 4 pages were read before written from vBIOS (I just want to skip finding out why vBIOS does this), the remaining thousands of pages were read before written from the guest Linux kernel. If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or "init_on_free=1", this read before written problem goes away. However, turning on those configs has side effects as said in kernel config message: "all page allocator and slab allocator memory will be zeroed when allocated, eliminating many kinds of "uninitialized heap memory" flaws, especially heap content exposures. The performance impact varies by workload, but most cases see <1% impact. Some synthetic workloads have measured as high as 7%." If without the above two configs, or if with init_on_alloc=0 && init_on_free=0, the root cause for all the reads of uninitialized heap memory are related to page cache pages of the guest virtual devices (specifically the virtual IDE device in my case). When the guest kernel triggers a guest block device read ahead, pages are allocated as page cache pages, and requests to read disk data into the page cache are issued. The disk data read requests will cause dma_direct_map_page() called if vIOMMU is not enabled. Then, because the virtual IDE device can only direct access 32-bit DMA address (equal to GPA) at maximum, swiotlb will be used as DMA bounce if page cache pages are with GPA > 32 bits. Then the call path is dma_direct_map_page() --> swiotlb_map() -->swiotlb_tbl_map_single() In swiotlb_tbl_map_single(), though DMA direction is DMA_FROM_DEVICE, this swiotlb_tbl_map_single() will call swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE) to read page cache content to the bounce buffer first. Then, during device DMAs, device content is DMAed into the bounce buffer. After that, the bounce buffer data will be copied back to the page cache page after calling dma_direct_unmap_page() --> swiotlb_tbl_unmap_single(). IOW, before reading ahead device data into the page cache, the page cache is read and copied to the bounce buffer, though an immediate write is followed to copy bounce buffer data back to the page cache. This explains why it's observed in host that most pages are written immediately after it's read, and .change_pte() occurs heavily during guest boot-up and nested guest boot-up, -- when disk readahead happens abundantly. The reason for this unconditional read of page into bounce buffer (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)") is explained in the code: /* * When dir == DMA_FROM_DEVICE we could omit the copy from the orig * to the tlb buffer, if we knew for sure the device will * overwrite the entire current content. But we don't. Thus * unconditional bounce may prevent leaking swiotlb content (i.e. * kernel memory) to user-space. */ If we neglect this risk and do changes like - swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); + if (dir != DMA_FROM_DEVICE) + swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); the issue of pages read before written from guest kernel just went away. I don't think it's a swiotlb bug, because to prevent leaking swiotlb content, if target page content is not copied firstly to the swiotlb's bounce buffer, then the bounce buffer needs to be initialized to 0. However, swiotlb_tbl_map_single() does not know whether the target page is initialized or not. Then, it would cause page content to be trimmed if device does not overwrite the entire memory. > > > > > If there's something else going on, i.e. if your VM really is somehow generating > > reads before writes, and if we really want to optimize use cases that can't use > > hugepages for whatever reason, I would much prefer to do something like add a > > memslot flag to state that the memslot should *always* be mapped writable. Because > Will check if this flag is necessary after figuring out the reason. As explained above, I think it's a valid and non-rare practice in guest kernel to cause read of uninitialized heap memory. And the host admin may not know exactly when it's appropriate to apply the memslot flag. Do you think it's good to make the "always write_fault = true" solution enabled by default?
+swiotbl maintainers and Linus Spinning off a discussion about swiotlb behavior to its own thread. Quick background: when running Linux as a KVM guest, Yan observed memory accesses where Linux is reading completely uninitialized memory (never been written by the guest) and traced it back to this code in the swiotlb: /* * When dir == DMA_FROM_DEVICE we could omit the copy from the orig * to the tlb buffer, if we knew for sure the device will * overwrite the entire current content. But we don't. Thus * unconditional bounce may prevent leaking swiotlb content (i.e. * kernel memory) to user-space. */ swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); The read-before-write behavior results in suboptimal performance as KVM maps the memory as read-only, and then triggers CoW when the guest inevitably writes the memory (CoW is significantly more expensive when KVM is involved). There are a variety of ways to workaround this in KVM, but even if we decide to address this in KVM, the swiotlb behavior is sketchy. Not to mention that any KVM changes are highly unlikely to be backported to LTS kernels. On Mon, Sep 04, 2023, Yan Zhao wrote: > ... > > > Actually, I don't even completely understand how you're seeing CoW behavior in > > > the first place. No sane guest should blindly read (or execute) uninitialized > > > memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK > > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has > > > been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off > > > KSM, because turning on KSM is antithetical to guest performance (not to mention > > > that KSM is wildly insecure for the guest, especially given the number of speculative > > > execution attacks these days). > > I'm running a linux guest. > > KSM is not turned on both in guest and host. > > Both guest and host have turned on transparent huge page. > > > > The guest first reads a GFN in a writable memslot (which is for "pc.ram"), > > which will cause > > (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn > > mapped. > > (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable, > > which will fail > > > > The guest then writes the GFN. > > This step will trigger (huge pmd split for huge page case) and .change_pte(). > > > > My guest is surely a sane guest. But currently I can't find out why > > certain pages are read before write. > > Will return back to you the reason after figuring it out after my long vacation. > Finally I figured out the reason. > > Except 4 pages were read before written from vBIOS (I just want to skip finding > out why vBIOS does this), the remaining thousands of pages were read before > written from the guest Linux kernel. ... > When the guest kernel triggers a guest block device read ahead, pages are > allocated as page cache pages, and requests to read disk data into the page > cache are issued. > > The disk data read requests will cause dma_direct_map_page() called if vIOMMU > is not enabled. Then, because the virtual IDE device can only direct access > 32-bit DMA address (equal to GPA) at maximum, swiotlb will be used as DMA > bounce if page cache pages are with GPA > 32 bits. > > Then the call path is > dma_direct_map_page() --> swiotlb_map() -->swiotlb_tbl_map_single() > > In swiotlb_tbl_map_single(), though DMA direction is DMA_FROM_DEVICE, > this swiotlb_tbl_map_single() will call > swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE) to read page cache > content to the bounce buffer first. > Then, during device DMAs, device content is DMAed into the bounce buffer. > After that, the bounce buffer data will be copied back to the page cache page > after calling dma_direct_unmap_page() --> swiotlb_tbl_unmap_single(). > > > IOW, before reading ahead device data into the page cache, the page cache is > read and copied to the bounce buffer, though an immediate write is followed to > copy bounce buffer data back to the page cache. > > This explains why it's observed in host that most pages are written immediately > after it's read, and .change_pte() occurs heavily during guest boot-up and > nested guest boot-up, -- when disk readahead happens abundantly. > > The reason for this unconditional read of page into bounce buffer > (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)") > is explained in the code: > > /* > * When dir == DMA_FROM_DEVICE we could omit the copy from the orig > * to the tlb buffer, if we knew for sure the device will > * overwrite the entire current content. But we don't. Thus > * unconditional bounce may prevent leaking swiotlb content (i.e. > * kernel memory) to user-space. > */ > > If we neglect this risk and do changes like > - swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); > + if (dir != DMA_FROM_DEVICE) > + swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); > > the issue of pages read before written from guest kernel just went away. > > I don't think it's a swiotlb bug, because to prevent leaking swiotlb > content, if target page content is not copied firstly to the swiotlb's > bounce buffer, then the bounce buffer needs to be initialized to 0. > However, swiotlb_tbl_map_single() does not know whether the target page > is initialized or not. Then, it would cause page content to be trimmed > if device does not overwrite the entire memory. The math doesn't add up though. Observing a read-before-write means the intended goal of preventing data leaks to userspace is not being met. I.e. instead of leaking old swiotlb, the kernel is (theoretically) leaking whatever data is in the original page (page cache in your case). That data *may* be completely uninitialized, especially during boot, but the original pages may also contain data from whatever was using the pages before they were allocated by the driver. It's possible the read of uninitialized data is observed only when either the driver that triggered the mapping _knows_ that the device will overwrite the entire mapping, or the driver will consume only the written parts. But in those cases, copying from the original memory is completely pointless. If neither of the above holds true, then copying from the original adds value only if preserving the data is necessary for functional correctness, or the driver explicitly initialized the original memory. Given that preserving source data was recently added, I highly doubt it's necessary for functional correctness. And if the driver *doesn't* initialize the data, then the copy is at best pointless, and possibly even worse than leaking stale swiotlb data. Looking at commit ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE"), IIUC the data leak was observed with a synthetic test "driver" that was developed to verify a real data leak fixed by commit a45b599ad808 ("scsi: sg: allocate with __GFP_ZERO in sg_build_indirect()"). Which basically proves my point: copying from the source only adds value absent a bug in the owning driver. IMO, rather than copying from the original memory, swiotlb_tbl_map_single() should simply zero the original page(s) when establishing the mapping. That would harden all usage of swiotlb and avoid the read-before-write behavior that is problematic for KVM.
On Tue, 5 Sept 2023 at 11:59, Sean Christopherson <seanjc@google.com> wrote: > > IMO, rather than copying from the original memory, swiotlb_tbl_map_single() should > simply zero the original page(s) when establishing the mapping. That would harden > all usage of swiotlb and avoid the read-before-write behavior that is problematic > for KVM. I don't disagree, but the argument at the time (I think from Christoph, but I might be barking entirely up the wrong tree) was that the swiotlb behavior should match hardware DMA, and we had known cases where the hardware only did partial IO. Honestly, even pre-zeroing sounds silly (and *much* too expensive, even if less so than copying), and we'd be better off if we only zero the rest once we have seen a partial DMA result, but I don't actually know if we have that partial size knowledge. Linus
On Mon, Sep 04, 2023, Yan Zhao wrote: > ... > > > Actually, I don't even completely understand how you're seeing CoW behavior in > > > the first place. No sane guest should blindly read (or execute) uninitialized > > > memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK > > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has > > > been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off > > > KSM, because turning on KSM is antithetical to guest performance (not to mention > > > that KSM is wildly insecure for the guest, especially given the number of speculative > > > execution attacks these days). > > I'm running a linux guest. > > KSM is not turned on both in guest and host. > > Both guest and host have turned on transparent huge page. > > > > The guest first reads a GFN in a writable memslot (which is for "pc.ram"), > > which will cause > > (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn > > mapped. > > (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable, > > which will fail > > > > The guest then writes the GFN. > > This step will trigger (huge pmd split for huge page case) and .change_pte(). > > > > My guest is surely a sane guest. But currently I can't find out why > > certain pages are read before write. > > Will return back to you the reason after figuring it out after my long vacation. > Finally I figured out the reason. > > Except 4 pages were read before written from vBIOS (I just want to skip finding > out why vBIOS does this), the remaining thousands of pages were read before > written from the guest Linux kernel. > > If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or > "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or > "init_on_free=1", this read before written problem goes away. > > However, turning on those configs has side effects as said in kernel config > message: > "all page allocator and slab allocator memory will be zeroed when allocated, > eliminating many kinds of "uninitialized heap memory" flaws, especially > heap content exposures. The performance impact varies by workload, but most > cases see <1% impact. Some synthetic workloads have measured as high as 7%." > > If without the above two configs, or if with init_on_alloc=0 && init_on_free=0, > the root cause for all the reads of uninitialized heap memory are related to Yeah, forcing the guest to pre-initialize all memory is a hack-a-fix and not a real solution. > page cache pages of the guest virtual devices (specifically the virtual IDE > device in my case). Why are you using IDE? IDE is comically slow compared to VirtIO, and VirtIO has been broadly supported for something like 15 years, even on Windows. > The reason for this unconditional read of page into bounce buffer > (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)") > is explained in the code: > > /* > * When dir == DMA_FROM_DEVICE we could omit the copy from the orig > * to the tlb buffer, if we knew for sure the device will > * overwrite the entire current content. But we don't. Thus > * unconditional bounce may prevent leaking swiotlb content (i.e. > * kernel memory) to user-space. > */ > > If we neglect this risk and do changes like > - swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); > + if (dir != DMA_FROM_DEVICE) > + swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); > > the issue of pages read before written from guest kernel just went away. > > I don't think it's a swiotlb bug, because to prevent leaking swiotlb > content, if target page content is not copied firstly to the swiotlb's > bounce buffer, then the bounce buffer needs to be initialized to 0. > However, swiotlb_tbl_map_single() does not know whether the target page > is initialized or not. Then, it would cause page content to be trimmed > if device does not overwrite the entire memory. > > > > > > > > > If there's something else going on, i.e. if your VM really is somehow generating > > > reads before writes, and if we really want to optimize use cases that can't use > > > hugepages for whatever reason, I would much prefer to do something like add a > > > memslot flag to state that the memslot should *always* be mapped writable. Because > > Will check if this flag is necessary after figuring out the reason. > As explained above, I think it's a valid and non-rare practice in guest kernel to > cause read of uninitialized heap memory. Heh, for some definitions of valid. > And the host admin may not know exactly when it's appropriate to apply the > memslot flag. Yeah, a memslot flag is too fine-grained. > Do you think it's good to make the "always write_fault = true" solution enabled > by default? Sadly, probably not, because that would regress setups that do want to utilize CoW, e.g. I'm pretty sure requesting everything to be writable would be a big negative for KSM. I do think we should add a KVM knob though. Regardless of the validity or frequency of the guest behavior, and even though userspace can also workaround this by preallocating guest memory, I am struggling to think of any reason outside of KSM where CoW semantics are desirable. Ooh, actually, maybe we could do static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM); and then cross our fingers that that doesn't regress some other funky setups.
On 2023-09-05 19:59, Sean Christopherson wrote: > +swiotbl maintainers and Linus > > Spinning off a discussion about swiotlb behavior to its own thread. > > Quick background: when running Linux as a KVM guest, Yan observed memory accesses > where Linux is reading completely uninitialized memory (never been written by the > guest) and traced it back to this code in the swiotlb: > > /* > * When dir == DMA_FROM_DEVICE we could omit the copy from the orig > * to the tlb buffer, if we knew for sure the device will > * overwrite the entire current content. But we don't. Thus > * unconditional bounce may prevent leaking swiotlb content (i.e. > * kernel memory) to user-space. > */ > swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); > > The read-before-write behavior results in suboptimal performance as KVM maps the > memory as read-only, and then triggers CoW when the guest inevitably writes the > memory (CoW is significantly more expensive when KVM is involved). > > There are a variety of ways to workaround this in KVM, but even if we decide to > address this in KVM, the swiotlb behavior is sketchy. Not to mention that any > KVM changes are highly unlikely to be backported to LTS kernels. > > On Mon, Sep 04, 2023, Yan Zhao wrote: >> ... >>>> Actually, I don't even completely understand how you're seeing CoW behavior in >>>> the first place. No sane guest should blindly read (or execute) uninitialized >>>> memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK >>>> QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has >>>> been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off >>>> KSM, because turning on KSM is antithetical to guest performance (not to mention >>>> that KSM is wildly insecure for the guest, especially given the number of speculative >>>> execution attacks these days). >>> I'm running a linux guest. >>> KSM is not turned on both in guest and host. >>> Both guest and host have turned on transparent huge page. >>> >>> The guest first reads a GFN in a writable memslot (which is for "pc.ram"), >>> which will cause >>> (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn >>> mapped. >>> (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable, >>> which will fail >>> >>> The guest then writes the GFN. >>> This step will trigger (huge pmd split for huge page case) and .change_pte(). >>> >>> My guest is surely a sane guest. But currently I can't find out why >>> certain pages are read before write. >>> Will return back to you the reason after figuring it out after my long vacation. >> Finally I figured out the reason. >> >> Except 4 pages were read before written from vBIOS (I just want to skip finding >> out why vBIOS does this), the remaining thousands of pages were read before >> written from the guest Linux kernel. > > ... > >> When the guest kernel triggers a guest block device read ahead, pages are >> allocated as page cache pages, and requests to read disk data into the page >> cache are issued. >> >> The disk data read requests will cause dma_direct_map_page() called if vIOMMU >> is not enabled. Then, because the virtual IDE device can only direct access >> 32-bit DMA address (equal to GPA) at maximum, swiotlb will be used as DMA >> bounce if page cache pages are with GPA > 32 bits. >> >> Then the call path is >> dma_direct_map_page() --> swiotlb_map() -->swiotlb_tbl_map_single() >> >> In swiotlb_tbl_map_single(), though DMA direction is DMA_FROM_DEVICE, >> this swiotlb_tbl_map_single() will call >> swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE) to read page cache >> content to the bounce buffer first. >> Then, during device DMAs, device content is DMAed into the bounce buffer. >> After that, the bounce buffer data will be copied back to the page cache page >> after calling dma_direct_unmap_page() --> swiotlb_tbl_unmap_single(). >> >> >> IOW, before reading ahead device data into the page cache, the page cache is >> read and copied to the bounce buffer, though an immediate write is followed to >> copy bounce buffer data back to the page cache. >> >> This explains why it's observed in host that most pages are written immediately >> after it's read, and .change_pte() occurs heavily during guest boot-up and >> nested guest boot-up, -- when disk readahead happens abundantly. >> >> The reason for this unconditional read of page into bounce buffer >> (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)") >> is explained in the code: >> >> /* >> * When dir == DMA_FROM_DEVICE we could omit the copy from the orig >> * to the tlb buffer, if we knew for sure the device will >> * overwrite the entire current content. But we don't. Thus >> * unconditional bounce may prevent leaking swiotlb content (i.e. >> * kernel memory) to user-space. >> */ >> >> If we neglect this risk and do changes like >> - swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); >> + if (dir != DMA_FROM_DEVICE) >> + swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); >> >> the issue of pages read before written from guest kernel just went away. >> >> I don't think it's a swiotlb bug, because to prevent leaking swiotlb >> content, if target page content is not copied firstly to the swiotlb's >> bounce buffer, then the bounce buffer needs to be initialized to 0. >> However, swiotlb_tbl_map_single() does not know whether the target page >> is initialized or not. Then, it would cause page content to be trimmed >> if device does not overwrite the entire memory. > > The math doesn't add up though. Observing a read-before-write means the intended > goal of preventing data leaks to userspace is not being met. I.e. instead of > leaking old swiotlb, the kernel is (theoretically) leaking whatever data is in the > original page (page cache in your case). Sure, it copies the destination page into the SWIOTLB buffer on map, then copies it back out again on unmap, so it's only "leaking" the previous contents of the page into the same page. Think of it as SWIOTLB doing a read-modify-write of the DMA-mapped region, because that's exactly what it's doing (on the basis that it can't be sure of overwriting it entirely). You can then consider it the same as if the device DMAs to the page directly without SWIOTLB being involved - if it does a partial write and that lets previous contents be mapped to userspace which shouldn't have been, that can only be the fault of whoever failed to sanitise the page beforehand. > That data *may* be completely uninitialized, especially during boot, but the > original pages may also contain data from whatever was using the pages before they > were allocated by the driver. > > It's possible the read of uninitialized data is observed only when either the > driver that triggered the mapping _knows_ that the device will overwrite the entire > mapping, or the driver will consume only the written parts. But in those cases, > copying from the original memory is completely pointless. Indeed, but sadly it is impossible for SWIOTLB to detect when that is or isn't the case. No matter what the SWIOTLB buffer is initialised with, there is always the possibility that the device coincidentally writes the same pattern of bytes, thus it cannot ever know for sure what was or wasn't touched between map and unmap. This is why in the original discussion we also tossed around the idea of a DMA attribute for the caller to indicate "I definitely expect the device to overwrite this entire mapping", but then we went round in circles a bit more, concluded that it might be hard to get right, and shelved it, but then the wrong version of the patch got merged which did still include the half-formed attribute idea, and then there was the whole other trainwreck of reverting and unreverting, and I have a feeling that what we ended up with still wasn't quite right but by that point I was so fed up of the whole business I had stopped caring. > If neither of the above holds true, then copying from the original adds value only > if preserving the data is necessary for functional correctness, or the driver > explicitly initialized the original memory. Given that preserving source data was > recently added, I highly doubt it's necessary for functional correctness. Seems kinda hard to say that with certainty - I would imagine that the amount of actual testing done with "swiotlb=force" is minimal at best. There are so many 64-bit-capable devices in the world which would never normally interact with SWIOTLB, but are liable to find themselves forced into doing so in future confidential compute etc. scenarios. I don't think I'd bet anything of value that *nothing* anywhere is depending on partial DMA writes being handled correctly. > And if the driver *doesn't* initialize the data, then the copy is at best pointless, > and possibly even worse than leaking stale swiotlb data. Other than the overhead, done right it can't be any worse than if SWIOTLB were not involved at all. > Looking at commit ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE"), > IIUC the data leak was observed with a synthetic test "driver" that was developed > to verify a real data leak fixed by commit a45b599ad808 ("scsi: sg: allocate with > __GFP_ZERO in sg_build_indirect()"). Which basically proves my point: copying > from the source only adds value absent a bug in the owning driver. Huh? IIUC the bug there was that the SCSI layer failed to sanitise partially-written buffers. That bug was fixed, and the scrutiny therein happened to reveal that SWIOTLB *also* had a lower-level problem with partial writes, in that it was corrupting DMA-mapped memory which was not updated by the device. Partial DMA writes are not in themselves indicative of a bug, they may well be a valid and expected behaviour. > IMO, rather than copying from the original memory, swiotlb_tbl_map_single() should > simply zero the original page(s) when establishing the mapping. That would harden > all usage of swiotlb and avoid the read-before-write behavior that is problematic > for KVM. Depends on one's exact definition of "harden"... Corrupting memory with zeros is less bad than corrupting memory with someone else's data if you look at it from an information security point of view, but from a not-corrupting-memory point of view it's definitely still corrupting memory :/ Taking a step back, is there not an argument that if people care about general KVM performance then they should maybe stop emulating obsolete PC hardware from 30 years ago, and at least emulate obsolete PC hardware from 20 years ago that supports 64-bit DMA? Even non-virtualised, SWIOTLB is pretty horrible for I/O performance by its very nature - avoiding it if at all possible should always be preferred. Thanks, Robin.
On Tue, Sep 05, 2023 at 01:18:23PM -0700, Sean Christopherson wrote: > On Mon, Sep 04, 2023, Yan Zhao wrote: > > ... > > > > Actually, I don't even completely understand how you're seeing CoW behavior in > > > > the first place. No sane guest should blindly read (or execute) uninitialized > > > > memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK > > > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has > > > > been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off > > > > KSM, because turning on KSM is antithetical to guest performance (not to mention > > > > that KSM is wildly insecure for the guest, especially given the number of speculative > > > > execution attacks these days). > > > I'm running a linux guest. > > > KSM is not turned on both in guest and host. > > > Both guest and host have turned on transparent huge page. > > > > > > The guest first reads a GFN in a writable memslot (which is for "pc.ram"), > > > which will cause > > > (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn > > > mapped. > > > (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable, > > > which will fail > > > > > > The guest then writes the GFN. > > > This step will trigger (huge pmd split for huge page case) and .change_pte(). > > > > > > My guest is surely a sane guest. But currently I can't find out why > > > certain pages are read before write. > > > Will return back to you the reason after figuring it out after my long vacation. > > Finally I figured out the reason. > > > > Except 4 pages were read before written from vBIOS (I just want to skip finding > > out why vBIOS does this), the remaining thousands of pages were read before > > written from the guest Linux kernel. > > > > If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or > > "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or > > "init_on_free=1", this read before written problem goes away. > > > > However, turning on those configs has side effects as said in kernel config > > message: > > "all page allocator and slab allocator memory will be zeroed when allocated, > > eliminating many kinds of "uninitialized heap memory" flaws, especially > > heap content exposures. The performance impact varies by workload, but most > > cases see <1% impact. Some synthetic workloads have measured as high as 7%." > > > > If without the above two configs, or if with init_on_alloc=0 && init_on_free=0, > > the root cause for all the reads of uninitialized heap memory are related to > > Yeah, forcing the guest to pre-initialize all memory is a hack-a-fix and not a > real solution. > > > page cache pages of the guest virtual devices (specifically the virtual IDE > > device in my case). > > Why are you using IDE? IDE is comically slow compared to VirtIO, and VirtIO has > been broadly supported for something like 15 years, even on Windows. I don't know why I'm using IDE. Maybe just because it's of my default settings for years :) And I sent this series was just because I think each guest write in this case has to cause two EPT violations is wasted. BTW, not only for IDE devices, I think any devices with DMA mask less than max GPA width will cause the same problem. And also true when "swiotlb=force" is enabled. > > > The reason for this unconditional read of page into bounce buffer > > (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)") > > is explained in the code: > > > > /* > > * When dir == DMA_FROM_DEVICE we could omit the copy from the orig > > * to the tlb buffer, if we knew for sure the device will > > * overwrite the entire current content. But we don't. Thus > > * unconditional bounce may prevent leaking swiotlb content (i.e. > > * kernel memory) to user-space. > > */ > > > > If we neglect this risk and do changes like > > - swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); > > + if (dir != DMA_FROM_DEVICE) > > + swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); > > > > the issue of pages read before written from guest kernel just went away. > > > > I don't think it's a swiotlb bug, because to prevent leaking swiotlb > > content, if target page content is not copied firstly to the swiotlb's > > bounce buffer, then the bounce buffer needs to be initialized to 0. > > However, swiotlb_tbl_map_single() does not know whether the target page > > is initialized or not. Then, it would cause page content to be trimmed > > if device does not overwrite the entire memory. > > > > > > > > > > > > > If there's something else going on, i.e. if your VM really is somehow generating > > > > reads before writes, and if we really want to optimize use cases that can't use > > > > hugepages for whatever reason, I would much prefer to do something like add a > > > > memslot flag to state that the memslot should *always* be mapped writable. Because > > > Will check if this flag is necessary after figuring out the reason. > > As explained above, I think it's a valid and non-rare practice in guest kernel to > > cause read of uninitialized heap memory. > > Heh, for some definitions of valid. > > > And the host admin may not know exactly when it's appropriate to apply the > > memslot flag. > > Yeah, a memslot flag is too fine-grained. > > > Do you think it's good to make the "always write_fault = true" solution enabled > > by default? > > Sadly, probably not, because that would regress setups that do want to utilize > CoW, e.g. I'm pretty sure requesting everything to be writable would be a big > negative for KSM. > > I do think we should add a KVM knob though. Regardless of the validity or frequency > of the guest behavior, and even though userspace can also workaround this by > preallocating guest memory, I am struggling to think of any reason outside of KSM > where CoW semantics are desirable. > > Ooh, actually, maybe we could do > > static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM); > > and then cross our fingers that that doesn't regress some other funky setups. Oh, this "always write_fault" solution may be not friendly to UFFD write protected pages too.
On Wed, Sep 06, 2023, Robin Murphy wrote: > On 2023-09-05 19:59, Sean Christopherson wrote: > > And if the driver *doesn't* initialize the data, then the copy is at best pointless, > > and possibly even worse than leaking stale swiotlb data. > > Other than the overhead, done right it can't be any worse than if SWIOTLB > were not involved at all. Yep. > > Looking at commit ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE"), > > IIUC the data leak was observed with a synthetic test "driver" that was developed > > to verify a real data leak fixed by commit a45b599ad808 ("scsi: sg: allocate with > > __GFP_ZERO in sg_build_indirect()"). Which basically proves my point: copying > > from the source only adds value absent a bug in the owning driver. > > Huh? IIUC the bug there was that the SCSI layer failed to sanitise > partially-written buffers. That bug was fixed, and the scrutiny therein > happened to reveal that SWIOTLB *also* had a lower-level problem with > partial writes, in that it was corrupting DMA-mapped memory which was not > updated by the device. Partial DMA writes are not in themselves indicative > of a bug, they may well be a valid and expected behaviour. The problem is that the comment only talks about leaking data to userspace, and doesn't say anything about data corruption or the "swiotlb needs to match hardware" justification that Linus pointed out. I buy both of those arguments for copying data from the original page, but the "may prevent leaking swiotlb content" is IMO completely nonsensical, because if preventing leakage is the only goal, then explicitly initializing the memory is better in every way. If no one objects, I'll put together a patch to rewrite the comment in terms of mimicking hardware and not corrupting the caller's data. > > IMO, rather than copying from the original memory, swiotlb_tbl_map_single() should > > simply zero the original page(s) when establishing the mapping. That would harden > > all usage of swiotlb and avoid the read-before-write behavior that is problematic > > for KVM. > > Depends on one's exact definition of "harden"... Corrupting memory with > zeros is less bad than corrupting memory with someone else's data if you > look at it from an information security point of view, but from a > not-corrupting-memory point of view it's definitely still corrupting memory > :/ > > Taking a step back, is there not an argument that if people care about > general KVM performance then they should maybe stop emulating obsolete PC > hardware from 30 years ago, and at least emulate obsolete PC hardware from > 20 years ago that supports 64-bit DMA? Heh, I don't think there's an argument per se, people most definitely shouldn't be emulating old hardware if they care about performance. I already told Yan as much. > Even non-virtualised, SWIOTLB is pretty horrible for I/O performance by its > very nature - avoiding it if at all possible should always be preferred. Yeah. The main reason I didn't just sweep this under the rug is the confidential VM use case, where SWIOTLB is used to bounce data from guest private memory into shread buffers. There's also a good argument that anyone that cares about I/O performance in confidential VMs should put in the effort to enlighten their device drivers to use shared memory directly, but practically speaking that's easier said than done.
On 2023-09-06 15:44, Sean Christopherson wrote: > On Wed, Sep 06, 2023, Robin Murphy wrote: >> On 2023-09-05 19:59, Sean Christopherson wrote: >>> And if the driver *doesn't* initialize the data, then the copy is at best pointless, >>> and possibly even worse than leaking stale swiotlb data. >> >> Other than the overhead, done right it can't be any worse than if SWIOTLB >> were not involved at all. > > Yep. > >>> Looking at commit ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE"), >>> IIUC the data leak was observed with a synthetic test "driver" that was developed >>> to verify a real data leak fixed by commit a45b599ad808 ("scsi: sg: allocate with >>> __GFP_ZERO in sg_build_indirect()"). Which basically proves my point: copying >>> from the source only adds value absent a bug in the owning driver. >> >> Huh? IIUC the bug there was that the SCSI layer failed to sanitise >> partially-written buffers. That bug was fixed, and the scrutiny therein >> happened to reveal that SWIOTLB *also* had a lower-level problem with >> partial writes, in that it was corrupting DMA-mapped memory which was not >> updated by the device. Partial DMA writes are not in themselves indicative >> of a bug, they may well be a valid and expected behaviour. > > The problem is that the comment only talks about leaking data to userspace, and > doesn't say anything about data corruption or the "swiotlb needs to match hardware" > justification that Linus pointed out. I buy both of those arguments for copying > data from the original page, but the "may prevent leaking swiotlb content" is IMO > completely nonsensical, because if preventing leakage is the only goal, then > explicitly initializing the memory is better in every way. > > If no one objects, I'll put together a patch to rewrite the comment in terms of > mimicking hardware and not corrupting the caller's data. Sounds good to me. I guess the trouble is that as soon as a CVE is involved it can then get hard to look past it, or want to risk appearing to downplay it :) >>> IMO, rather than copying from the original memory, swiotlb_tbl_map_single() should >>> simply zero the original page(s) when establishing the mapping. That would harden >>> all usage of swiotlb and avoid the read-before-write behavior that is problematic >>> for KVM. >> >> Depends on one's exact definition of "harden"... Corrupting memory with >> zeros is less bad than corrupting memory with someone else's data if you >> look at it from an information security point of view, but from a >> not-corrupting-memory point of view it's definitely still corrupting memory >> :/ >> >> Taking a step back, is there not an argument that if people care about >> general KVM performance then they should maybe stop emulating obsolete PC >> hardware from 30 years ago, and at least emulate obsolete PC hardware from >> 20 years ago that supports 64-bit DMA? > > Heh, I don't think there's an argument per se, people most definitely shouldn't > be emulating old hardware if they care about performance. I already told Yan as > much. > >> Even non-virtualised, SWIOTLB is pretty horrible for I/O performance by its >> very nature - avoiding it if at all possible should always be preferred. > > Yeah. The main reason I didn't just sweep this under the rug is the confidential > VM use case, where SWIOTLB is used to bounce data from guest private memory into > shread buffers. There's also a good argument that anyone that cares about I/O > performance in confidential VMs should put in the effort to enlighten their device > drivers to use shared memory directly, but practically speaking that's easier said > than done. Indeed a bunch of work has gone into SWIOTLB recently trying to make it a bit more efficient for such cases where it can't be avoided, so it is definitely still interesting to learn about impacts at other levels like this. Maybe there's a bit of a get-out for confidential VMs though, since presumably there's not much point COW-ing encrypted private memory, so perhaps KVM might end up wanting to optimise that out and thus happen to end up less sensitive to unavoidable SWIOTLB behaviour anyway? Cheers, Robin.
On Wed, Sep 06, 2023, Robin Murphy wrote: > On 2023-09-06 15:44, Sean Christopherson wrote: > > On Wed, Sep 06, 2023, Robin Murphy wrote: > > > Even non-virtualised, SWIOTLB is pretty horrible for I/O performance by its > > > very nature - avoiding it if at all possible should always be preferred. > > > > Yeah. The main reason I didn't just sweep this under the rug is the confidential > > VM use case, where SWIOTLB is used to bounce data from guest private memory into > > shread buffers. There's also a good argument that anyone that cares about I/O > > performance in confidential VMs should put in the effort to enlighten their device > > drivers to use shared memory directly, but practically speaking that's easier said > > than done. > > Indeed a bunch of work has gone into SWIOTLB recently trying to make it a > bit more efficient for such cases where it can't be avoided, so it is > definitely still interesting to learn about impacts at other levels like > this. Maybe there's a bit of a get-out for confidential VMs though, since > presumably there's not much point COW-ing encrypted private memory, so > perhaps KVM might end up wanting to optimise that out and thus happen to end > up less sensitive to unavoidable SWIOTLB behaviour anyway? CoW should be a non-issue for confidential VMs, at least on x86. SEV and SEV-ES are effectively forced to pin memory as writable before it can be mapped into the guest. TDX and SNP and will have a different implementation, but similar behavior. Confidential VMs would benefit purely by either eliminating or reducing the cost of "initializing" memory, i.e. by eliminating the memcpy() or replacing it with a memset(). I most definitely don't care enough about confidential VM I/O performance to try and micro-optimize that behavior, their existence was simply what made me look more closely instead of just telling Yan to stop using IDE :-)
On Tue, Sep 5, 2023 at 10:18 PM Sean Christopherson <seanjc@google.com> wrote: > Ooh, actually, maybe we could do > > static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM); > > and then cross our fingers that that doesn't regress some other funky setups. It probably breaks gvisor-like setups that use MAP_PRIVATE mmap for memslots? It would instantly break CoW even if memory is never written. Paolo
On Wed, Sep 06, 2023 at 05:51:36PM -0700, Sean Christopherson wrote: > On Thu, Sep 07, 2023, Paolo Bonzini wrote: > > On Tue, Sep 5, 2023 at 10:18 PM Sean Christopherson <seanjc@google.com> wrote: > > > Ooh, actually, maybe we could do > > > > > > static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM); > > > > > > and then cross our fingers that that doesn't regress some other funky setups. > > > > It probably breaks gvisor-like setups that use MAP_PRIVATE mmap for > > memslots? It would instantly break CoW even if memory is never > > written. > > Doh, I completely forgot about gvisor and the like. > > Yan, I don't think this is worth pursuing. My understanding is that only legacy, > relatively slow devices need DMA32. And as Robin pointed out, swiotlb=force isn't > something that's likely deployed and certainly isn't intended for performance > sensitive environments. Yes. Then will you reconsider my patch 2? I think it can remove the second EPT violation for each page write and doesn't break KSM, and COW.
On Thu, Sep 07, 2023, Paolo Bonzini wrote: > On Tue, Sep 5, 2023 at 10:18 PM Sean Christopherson <seanjc@google.com> wrote: > > Ooh, actually, maybe we could do > > > > static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM); > > > > and then cross our fingers that that doesn't regress some other funky setups. > > It probably breaks gvisor-like setups that use MAP_PRIVATE mmap for > memslots? It would instantly break CoW even if memory is never > written. Doh, I completely forgot about gvisor and the like. Yan, I don't think this is worth pursuing. My understanding is that only legacy, relatively slow devices need DMA32. And as Robin pointed out, swiotlb=force isn't something that's likely deployed and certainly isn't intended for performance sensitive environments.
On Wed, Sep 06, 2023 at 05:18:51PM +0100, Robin Murphy wrote: > Indeed a bunch of work has gone into SWIOTLB recently trying to make it a > bit more efficient for such cases where it can't be avoided, so it is > definitely still interesting to learn about impacts at other levels like > this. Maybe there's a bit of a get-out for confidential VMs though, since > presumably there's not much point COW-ing encrypted private memory, so > perhaps KVM might end up wanting to optimise that out and thus happen to > end up less sensitive to unavoidable SWIOTLB behaviour anyway? Well, the fix for bounce buffering is to trust the device, and there is a lot of work going into device authentication and attesttion right now so that will happen. On the swiotlb side a new version of the dma_sync_*_device APIs that specifies the mapping len and the data length transfer would avoid some of the overhead here. We've decided that it is a good idea last time, but so far no one has volunteers to implement it.