Message ID | 20191003212400.31130-1-rick.p.edgecombe@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | XOM for KVM guest userspace | expand |
On 03/10/19 23:23, Rick Edgecombe wrote: > Since software would have previously received a #PF with the RSVD error code > set, when the HW encountered any set bits in the region 51 to M, there was some > internal discussion on whether this should have a virtual MSR for the OS to turn > it on only if the OS knows it isn't relying on this behavior for bit M. The > argument against needing an MSR is this blurb from the Intel SDM about reserved > bits: > "Bits reserved in the paging-structure entries are reserved for future > functionality. Software developers should be aware that such bits may be used in > the future and that a paging-structure entry that causes a page-fault exception > on one processor might not do so in the future." > > So in the current patchset there is no MSR write required for the guest to turn > on this feature. It will have this behavior whenever qemu is run with > "-cpu +xo". I think the part of the manual that you quote is out of date. Whenever Intel has "unreserved" bits in the page tables they have done that only if specific bits in CR4 or EFER or VMCS execution controls are set; this is a good thing, and I'd really like it to be codified in the SDM. The only bits for which this does not (and should not) apply are indeed bits 51:MAXPHYADDR. But the SDM makes it clear that bits 51:MAXPHYADDR are reserved, hence "unreserving" bits based on just a QEMU command line option would be against the specification. So, please don't do this and introduce an MSR that enables the feature. Paolo
On Thu, Oct 3, 2019 at 2:38 PM Rick Edgecombe <rick.p.edgecombe@intel.com> wrote: > > This patchset enables the ability for KVM guests to create execute-only (XO) > memory by utilizing EPT based XO permissions. XO memory is currently supported > on Intel hardware natively for CPU's with PKU, but this enables it on older > platforms, and can support XO for kernel memory as well. The patchset seems to sometimes call this feature "XO" and sometimes call it "NR". To me, XO implies no-read and no-write, whereas NR implies just no-read. Can you please clarify *exactly* what the new bit does and be consistent? I suggest that you make it NR, which allows for PROT_EXEC and PROT_EXEC|PROT_WRITE and plain PROT_WRITE. WX is of dubious value, but I can imagine plain W being genuinely useful for logging and for JITs that could maintain a W and a separate X mapping of some code. In other words, with an NR bit, all 8 logical access modes are possible. Also, keeping the paging bits more orthogonal seems nice -- we already have a bit that controls write access.
On Fri, 2019-10-04 at 09:22 +0200, Paolo Bonzini wrote: > On 03/10/19 23:23, Rick Edgecombe wrote: > > Since software would have previously received a #PF with the RSVD error code > > set, when the HW encountered any set bits in the region 51 to M, there was > > some > > internal discussion on whether this should have a virtual MSR for the OS to > > turn > > it on only if the OS knows it isn't relying on this behavior for bit M. The > > argument against needing an MSR is this blurb from the Intel SDM about > > reserved > > bits: > > "Bits reserved in the paging-structure entries are reserved for future > > functionality. Software developers should be aware that such bits may be > > used in > > the future and that a paging-structure entry that causes a page-fault > > exception > > on one processor might not do so in the future." > > > > So in the current patchset there is no MSR write required for the guest to > > turn > > on this feature. It will have this behavior whenever qemu is run with > > "-cpu +xo". > > I think the part of the manual that you quote is out of date. Whenever > Intel has "unreserved" bits in the page tables they have done that only > if specific bits in CR4 or EFER or VMCS execution controls are set; this > is a good thing, and I'd really like it to be codified in the SDM. > > The only bits for which this does not (and should not) apply are indeed > bits 51:MAXPHYADDR. But the SDM makes it clear that bits 51:MAXPHYADDR > are reserved, hence "unreserving" bits based on just a QEMU command line > option would be against the specification. So, please don't do this and > introduce an MSR that enables the feature. > > Paolo > Hi Paolo, Thanks for taking a look! Fair enough, MSR it is. Rick
On Fri, 2019-10-04 at 07:56 -0700, Andy Lutomirski wrote: > On Thu, Oct 3, 2019 at 2:38 PM Rick Edgecombe > <rick.p.edgecombe@intel.com> wrote: > > > > This patchset enables the ability for KVM guests to create execute-only (XO) > > memory by utilizing EPT based XO permissions. XO memory is currently > > supported > > on Intel hardware natively for CPU's with PKU, but this enables it on older > > platforms, and can support XO for kernel memory as well. > > The patchset seems to sometimes call this feature "XO" and sometimes > call it "NR". To me, XO implies no-read and no-write, whereas NR > implies just no-read. Can you please clarify *exactly* what the new > bit does and be consistent? > > I suggest that you make it NR, which allows for PROT_EXEC and > PROT_EXEC|PROT_WRITE and plain PROT_WRITE. WX is of dubious value, > but I can imagine plain W being genuinely useful for logging and for > JITs that could maintain a W and a separate X mapping of some code. > In other words, with an NR bit, all 8 logical access modes are > possible. Also, keeping the paging bits more orthogonal seems nice -- > we already have a bit that controls write access. Sorry, yes the behavior of this bit needs to be documented a lot better. I will definitely do this for the next version. To clarify, since the EPT permissions in the XO/NR range are executable, and not readable or writeable the new bit really means XO, but only when NX is 0 since the guest page tables are being checked as well. When NR=1, W=1, and NX=0, the memory is still XO. NR was picked over XO because as you say. The idea is that it can be defined that in the case of KVM XO, NR and writable is not a valid combination, like writeable but not readable is defined as not valid for the EPT. I *think* whenever NX=1, NR=1 it should be similar to not present in that it can't be used for anything or have its translation cached. I am not 100% sure on the cached part and was thinking of just making the "spec" that the translation caching behavior is undefined. I can look into this if anyone thinks we need to know. In the current patchset it shouldn't be possible to create this combination. Since write-only memory isn't supported in EPT we can't do the same trick to create a new HW permission. But I guess if we emulate it, we could make the new bit mean just NR, and support write-only by allowing emulation when KVM gets a write EPT violations to NR memory. It might still be useful for the JIT case you mentioned, or a shared memory mailbox. On the other hand, userspace might be surprised to encounter that memory is different speeds depending on the permission. I also wonder if any userspace apps are asking for just PROT_WRITE and expecting readable memory. Thanks, Rick
On Fri, Oct 4, 2019 at 1:10 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Fri, 2019-10-04 at 07:56 -0700, Andy Lutomirski wrote: > > On Thu, Oct 3, 2019 at 2:38 PM Rick Edgecombe > > <rick.p.edgecombe@intel.com> wrote: > > > > > > This patchset enables the ability for KVM guests to create execute-only (XO) > > > memory by utilizing EPT based XO permissions. XO memory is currently > > > supported > > > on Intel hardware natively for CPU's with PKU, but this enables it on older > > > platforms, and can support XO for kernel memory as well. > > > > The patchset seems to sometimes call this feature "XO" and sometimes > > call it "NR". To me, XO implies no-read and no-write, whereas NR > > implies just no-read. Can you please clarify *exactly* what the new > > bit does and be consistent? > > > > I suggest that you make it NR, which allows for PROT_EXEC and > > PROT_EXEC|PROT_WRITE and plain PROT_WRITE. WX is of dubious value, > > but I can imagine plain W being genuinely useful for logging and for > > JITs that could maintain a W and a separate X mapping of some code. > > In other words, with an NR bit, all 8 logical access modes are > > possible. Also, keeping the paging bits more orthogonal seems nice -- > > we already have a bit that controls write access. > > Sorry, yes the behavior of this bit needs to be documented a lot better. I will > definitely do this for the next version. > > To clarify, since the EPT permissions in the XO/NR range are executable, and not > readable or writeable the new bit really means XO, but only when NX is 0 since > the guest page tables are being checked as well. When NR=1, W=1, and NX=0, the > memory is still XO. > > NR was picked over XO because as you say. The idea is that it can be defined > that in the case of KVM XO, NR and writable is not a valid combination, like > writeable but not readable is defined as not valid for the EPT. > Ugh, I see, this is an "EPT Misconfiguration". Oh, well. I guess just keep things as they are and document things better, please. Don't try to emulate. I don't suppose Intel could be convinced to get rid of that in a future CPU and allow write-only memory? BTW, is your patch checking for support in IA32_VMX_EPT_VPID_CAP? I didn't notice it, but I didn't look that hard.
On Fri, 2019-10-04 at 18:33 -0700, Andy Lutomirski wrote: > On Fri, Oct 4, 2019 at 1:10 PM Edgecombe, Rick P > <rick.p.edgecombe@intel.com> wrote: > > > > On Fri, 2019-10-04 at 07:56 -0700, Andy Lutomirski wrote: > > > On Thu, Oct 3, 2019 at 2:38 PM Rick Edgecombe > > > <rick.p.edgecombe@intel.com> wrote: > > > > > > > > This patchset enables the ability for KVM guests to create execute-only > > > > (XO) > > > > memory by utilizing EPT based XO permissions. XO memory is currently > > > > supported > > > > on Intel hardware natively for CPU's with PKU, but this enables it on > > > > older > > > > platforms, and can support XO for kernel memory as well. > > > > > > The patchset seems to sometimes call this feature "XO" and sometimes > > > call it "NR". To me, XO implies no-read and no-write, whereas NR > > > implies just no-read. Can you please clarify *exactly* what the new > > > bit does and be consistent? > > > > > > I suggest that you make it NR, which allows for PROT_EXEC and > > > PROT_EXEC|PROT_WRITE and plain PROT_WRITE. WX is of dubious value, > > > but I can imagine plain W being genuinely useful for logging and for > > > JITs that could maintain a W and a separate X mapping of some code. > > > In other words, with an NR bit, all 8 logical access modes are > > > possible. Also, keeping the paging bits more orthogonal seems nice -- > > > we already have a bit that controls write access. > > > > Sorry, yes the behavior of this bit needs to be documented a lot better. I > > will > > definitely do this for the next version. > > > > To clarify, since the EPT permissions in the XO/NR range are executable, and > > not > > readable or writeable the new bit really means XO, but only when NX is 0 > > since > > the guest page tables are being checked as well. When NR=1, W=1, and NX=0, > > the > > memory is still XO. > > > > NR was picked over XO because as you say. The idea is that it can be defined > > that in the case of KVM XO, NR and writable is not a valid combination, like > > writeable but not readable is defined as not valid for the EPT. > > > > Ugh, I see, this is an "EPT Misconfiguration". Oh, well. I guess > just keep things as they are and document things better, please. > Don't try to emulate. Ah, I see what you were thinking. Ok will do. > I don't suppose Intel could be convinced to get rid of that in a > future CPU and allow write-only memory? Hmm, I'm not sure. I can try to pass it along. > BTW, is your patch checking for support in IA32_VMX_EPT_VPID_CAP? I > didn't notice it, but I didn't look that hard. Yep, there was already a helper: cpu_has_vmx_ept_execute_only().
On Thu, Oct 03, 2019 at 02:23:47PM -0700, Rick Edgecombe wrote:
> larger follow on to this enables setting the kernel text as XO, but this is just
Is the kernel side series visible somewhere public yet?
On Tue, 2019-10-29 at 16:40 -0700, Kees Cook wrote: > On Thu, Oct 03, 2019 at 02:23:47PM -0700, Rick Edgecombe wrote: > > larger follow on to this enables setting the kernel text as XO, but this is > > just > > Is the kernel side series visible somewhere public yet? > The POC from my Plumber's talk is up here: https://github.com/redgecombe/linux/commits/exec_only It doesn't work with this KVM series though as I made changes on the KVM side. I don't consider it ready for posting on the list yet. Luckily though, PeterZ's switching of ftrace to text_poke(), and your exception table patchset will make it easier when the time comes. Right now I am re-doing the KVM pieces to get rid of the memslot duplication. I am ending up having to touch a lot more KVM mmu code, and it's taken some time to work through. Then I wanted get some more performance numbers before dropping the RFC tag. So it may still be a bit before I can pick up the kernel text piece again.