Message ID | 20230119212317.8324-19-rick.p.edgecombe@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Shadow stacks for userspace | expand |
On 19.01.23 22:22, Rick Edgecombe wrote: > The x86 Control-flow Enforcement Technology (CET) feature includes a new > type of memory called shadow stack. This shadow stack memory has some > unusual properties, which requires some core mm changes to function > properly. > > Since shadow stack memory can be changed from userspace, is both > VM_SHADOW_STACK and VM_WRITE. But it should not be made conventionally > writable (i.e. pte_mkwrite()). So some code that calls pte_mkwrite() needs > to be adjusted. > > One such case is when memory is made writable without an actual write > fault. This happens in some mprotect operations, and also prot_numa faults. > In both cases code checks whether it should be made (conventionally) > writable by calling vma_wants_manual_pte_write_upgrade(). > > One way to fix this would be have code actually check if memory is also > VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But since > most memory won't be shadow stack, just have simpler logic and skip this > optimization by changing vma_wants_manual_pte_write_upgrade() to not > return true for VM_SHADOW_STACK_MEMORY. This will simply handle all > cases of this type. > > Cc: David Hildenbrand <david@redhat.com> > Tested-by: Pengfei Xu <pengfei.xu@intel.com> > Tested-by: John Allen <john.allen@amd.com> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> > --- Instead of having these x86-shadow stack details all over the MM space, was the option explored to handle this more in arch specific code? IIUC, one way to get it working would be 1) Have a SW "shadowstack" PTE flag. 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when "write=0". pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions based on the "shadowstack" PTE flag and hide all these details from core-mm. When mapping a shadowstack page (new page, migration, swapin, ...), which can be obtained by looking at the VMA flags, the first thing you'd do is set the "shadowstack" PTE flag.
On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote: > On 19.01.23 22:22, Rick Edgecombe wrote: > > The x86 Control-flow Enforcement Technology (CET) feature includes > > a new > > type of memory called shadow stack. This shadow stack memory has > > some > > unusual properties, which requires some core mm changes to function > > properly. > > > > Since shadow stack memory can be changed from userspace, is both > > VM_SHADOW_STACK and VM_WRITE. But it should not be made > > conventionally > > writable (i.e. pte_mkwrite()). So some code that calls > > pte_mkwrite() needs > > to be adjusted. > > > > One such case is when memory is made writable without an actual > > write > > fault. This happens in some mprotect operations, and also prot_numa > > faults. > > In both cases code checks whether it should be made > > (conventionally) > > writable by calling vma_wants_manual_pte_write_upgrade(). > > > > One way to fix this would be have code actually check if memory is > > also > > VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But > > since > > most memory won't be shadow stack, just have simpler logic and skip > > this > > optimization by changing vma_wants_manual_pte_write_upgrade() to > > not > > return true for VM_SHADOW_STACK_MEMORY. This will simply handle all > > cases of this type. > > > > Cc: David Hildenbrand <david@redhat.com> > > Tested-by: Pengfei Xu <pengfei.xu@intel.com> > > Tested-by: John Allen <john.allen@amd.com> > > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> > > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> > > --- > > Instead of having these x86-shadow stack details all over the MM > space, > was the option explored to handle this more in arch specific code? > > IIUC, one way to get it working would be > > 1) Have a SW "shadowstack" PTE flag. > 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when "write=0". I don't think that idea came up. So vma->vm_page_prot would have the SW shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do Write=0,Dirty=1 part. It seems like it should work. > > pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions > based > on the "shadowstack" PTE flag and hide all these details from core- > mm. > > When mapping a shadowstack page (new page, migration, swapin, ...), > which can be obtained by looking at the VMA flags, the first thing > you'd > do is set the "shadowstack" PTE flag. I guess the downside is that it uses an extra software bit. But the other positive is that it's less error prone, so that someone writing core-mm code won't introduce a change that makes shadow stack VMAs Write=1 if they don't know to also check for VM_SHADOW_STACK.
On 23.01.23 21:47, Edgecombe, Rick P wrote: > On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote: >> On 19.01.23 22:22, Rick Edgecombe wrote: >>> The x86 Control-flow Enforcement Technology (CET) feature includes >>> a new >>> type of memory called shadow stack. This shadow stack memory has >>> some >>> unusual properties, which requires some core mm changes to function >>> properly. >>> >>> Since shadow stack memory can be changed from userspace, is both >>> VM_SHADOW_STACK and VM_WRITE. But it should not be made >>> conventionally >>> writable (i.e. pte_mkwrite()). So some code that calls >>> pte_mkwrite() needs >>> to be adjusted. >>> >>> One such case is when memory is made writable without an actual >>> write >>> fault. This happens in some mprotect operations, and also prot_numa >>> faults. >>> In both cases code checks whether it should be made >>> (conventionally) >>> writable by calling vma_wants_manual_pte_write_upgrade(). >>> >>> One way to fix this would be have code actually check if memory is >>> also >>> VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But >>> since >>> most memory won't be shadow stack, just have simpler logic and skip >>> this >>> optimization by changing vma_wants_manual_pte_write_upgrade() to >>> not >>> return true for VM_SHADOW_STACK_MEMORY. This will simply handle all >>> cases of this type. >>> >>> Cc: David Hildenbrand <david@redhat.com> >>> Tested-by: Pengfei Xu <pengfei.xu@intel.com> >>> Tested-by: John Allen <john.allen@amd.com> >>> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> >>> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> >>> --- >> >> Instead of having these x86-shadow stack details all over the MM >> space, >> was the option explored to handle this more in arch specific code? >> >> IIUC, one way to get it working would be >> >> 1) Have a SW "shadowstack" PTE flag. >> 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when "write=0". > > I don't think that idea came up. So vma->vm_page_prot would have the SW > shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do > Write=0,Dirty=1 part. It seems like it should work. > Right, if we include it in vma->vm_page_prot, we'd immediately let mk_pte() just handle that. Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma instead of the vma->vm_page_prot. Let's see if we can avoid that for now. >> >> pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions >> based >> on the "shadowstack" PTE flag and hide all these details from core- >> mm. >> >> When mapping a shadowstack page (new page, migration, swapin, ...), >> which can be obtained by looking at the VMA flags, the first thing >> you'd >> do is set the "shadowstack" PTE flag. > > I guess the downside is that it uses an extra software bit. But the > other positive is that it's less error prone, so that someone writing > core-mm code won't introduce a change that makes shadow stack VMAs > Write=1 if they don't know to also check for VM_SHADOW_STACK. Right. And I think this mimics the what I would have expected HW to provide: a dedicated HW bit, not somehow mangling this into semantics of existing bits. Roughly speaking: if we abstract it that way and get all of the "how to set it writable now?" out of core-MM, it not only is cleaner and less error prone, it might even allow other architectures that implement something comparable (e.g., using a dedicated HW bit) to actually reuse some of that work. Otherwise most of that "shstk" is really just x86 specific ... I guess the only cases we have to special case would be page pinning code where pte_write() would indicate that the PTE is writable (well, it is, just not by "ordinary CPU instruction" context directly): but you do that already, so ... :) Sorry for stumbling over that this late, I only started looking into this when you CCed me on that one patch.
On Tue, 2023-01-24 at 17:24 +0100, David Hildenbrand wrote: > On 23.01.23 21:47, Edgecombe, Rick P wrote: > > On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote: > > > On 19.01.23 22:22, Rick Edgecombe wrote: > > > > The x86 Control-flow Enforcement Technology (CET) feature > > > > includes > > > > a new > > > > type of memory called shadow stack. This shadow stack memory > > > > has > > > > some > > > > unusual properties, which requires some core mm changes to > > > > function > > > > properly. > > > > > > > > Since shadow stack memory can be changed from userspace, is > > > > both > > > > VM_SHADOW_STACK and VM_WRITE. But it should not be made > > > > conventionally > > > > writable (i.e. pte_mkwrite()). So some code that calls > > > > pte_mkwrite() needs > > > > to be adjusted. > > > > > > > > One such case is when memory is made writable without an actual > > > > write > > > > fault. This happens in some mprotect operations, and also > > > > prot_numa > > > > faults. > > > > In both cases code checks whether it should be made > > > > (conventionally) > > > > writable by calling vma_wants_manual_pte_write_upgrade(). > > > > > > > > One way to fix this would be have code actually check if memory > > > > is > > > > also > > > > VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But > > > > since > > > > most memory won't be shadow stack, just have simpler logic and > > > > skip > > > > this > > > > optimization by changing vma_wants_manual_pte_write_upgrade() > > > > to > > > > not > > > > return true for VM_SHADOW_STACK_MEMORY. This will simply handle > > > > all > > > > cases of this type. > > > > > > > > Cc: David Hildenbrand <david@redhat.com> > > > > Tested-by: Pengfei Xu <pengfei.xu@intel.com> > > > > Tested-by: John Allen <john.allen@amd.com> > > > > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> > > > > Reviewed-by: Kirill A. Shutemov < > > > > kirill.shutemov@linux.intel.com> > > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> > > > > --- > > > > > > Instead of having these x86-shadow stack details all over the MM > > > space, > > > was the option explored to handle this more in arch specific > > > code? > > > > > > IIUC, one way to get it working would be > > > > > > 1) Have a SW "shadowstack" PTE flag. > > > 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when > > > "write=0". > > > > I don't think that idea came up. So vma->vm_page_prot would have > > the SW > > shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do > > Write=0,Dirty=1 part. It seems like it should work. > > > > Right, if we include it in vma->vm_page_prot, we'd immediately let > mk_pte() just handle that. > > Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma > instead > of the vma->vm_page_prot. Let's see if we can avoid that for now. > > > > > > > pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions > > > based > > > on the "shadowstack" PTE flag and hide all these details from > > > core- > > > mm. > > > > > > When mapping a shadowstack page (new page, migration, swapin, > > > ...), > > > which can be obtained by looking at the VMA flags, the first > > > thing > > > you'd > > > do is set the "shadowstack" PTE flag. > > > > I guess the downside is that it uses an extra software bit. But the > > other positive is that it's less error prone, so that someone > > writing > > core-mm code won't introduce a change that makes shadow stack VMAs > > Write=1 if they don't know to also check for VM_SHADOW_STACK. > > Right. And I think this mimics the what I would have expected HW to > provide: a dedicated HW bit, not somehow mangling this into semantics > of > existing bits. Yea. > > Roughly speaking: if we abstract it that way and get all of the "how > to > set it writable now?" out of core-MM, it not only is cleaner and > less > error prone, it might even allow other architectures that implement > something comparable (e.g., using a dedicated HW bit) to actually > reuse > some of that work. Otherwise most of that "shstk" is really just x86 > specific ... > > I guess the only cases we have to special case would be page pinning > code where pte_write() would indicate that the PTE is writable (well, > it > is, just not by "ordinary CPU instruction" context directly): but you > do > that already, so ... :) > > Sorry for stumbling over that this late, I only started looking into > this when you CCed me on that one patch. Sorry for not calling more attention to it earlier. Appreciate your comments. Previously versions of this series had changed some of these pte_mkwrite() calls to maybe_mkwrite(), which of course takes a vma. This way an x86 implementation could use the VM_SHADOW_STACK vma flag to decide between pte_mkwrite() and pte_mkwrite_shstk(). The feedback was that in some of these code paths "maybe" isn't really an option, it *needs* to make it writable. Even though the logic was the same, the name of the function made it look wrong. But another option could be to change pte_mkwrite() to take a vma. This would save using another software bit on x86, but instead requires a small change to each arch's pte_mkwrite(). x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), but maybe it could additionally warn if the vma is not writable. It also seems more aligned with your changes to stop taking hints from PTE bits and just look at the VMA? (I'm thinking about the dropping of the dirty check in GUP and dropping pte_saved_write())
On 24.01.23 19:14, Edgecombe, Rick P wrote: > On Tue, 2023-01-24 at 17:24 +0100, David Hildenbrand wrote: >> On 23.01.23 21:47, Edgecombe, Rick P wrote: >>> On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote: >>>> On 19.01.23 22:22, Rick Edgecombe wrote: >>>>> The x86 Control-flow Enforcement Technology (CET) feature >>>>> includes >>>>> a new >>>>> type of memory called shadow stack. This shadow stack memory >>>>> has >>>>> some >>>>> unusual properties, which requires some core mm changes to >>>>> function >>>>> properly. >>>>> >>>>> Since shadow stack memory can be changed from userspace, is >>>>> both >>>>> VM_SHADOW_STACK and VM_WRITE. But it should not be made >>>>> conventionally >>>>> writable (i.e. pte_mkwrite()). So some code that calls >>>>> pte_mkwrite() needs >>>>> to be adjusted. >>>>> >>>>> One such case is when memory is made writable without an actual >>>>> write >>>>> fault. This happens in some mprotect operations, and also >>>>> prot_numa >>>>> faults. >>>>> In both cases code checks whether it should be made >>>>> (conventionally) >>>>> writable by calling vma_wants_manual_pte_write_upgrade(). >>>>> >>>>> One way to fix this would be have code actually check if memory >>>>> is >>>>> also >>>>> VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But >>>>> since >>>>> most memory won't be shadow stack, just have simpler logic and >>>>> skip >>>>> this >>>>> optimization by changing vma_wants_manual_pte_write_upgrade() >>>>> to >>>>> not >>>>> return true for VM_SHADOW_STACK_MEMORY. This will simply handle >>>>> all >>>>> cases of this type. >>>>> >>>>> Cc: David Hildenbrand <david@redhat.com> >>>>> Tested-by: Pengfei Xu <pengfei.xu@intel.com> >>>>> Tested-by: John Allen <john.allen@amd.com> >>>>> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> >>>>> Reviewed-by: Kirill A. Shutemov < >>>>> kirill.shutemov@linux.intel.com> >>>>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> >>>>> --- >>>> >>>> Instead of having these x86-shadow stack details all over the MM >>>> space, >>>> was the option explored to handle this more in arch specific >>>> code? >>>> >>>> IIUC, one way to get it working would be >>>> >>>> 1) Have a SW "shadowstack" PTE flag. >>>> 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when >>>> "write=0". >>> >>> I don't think that idea came up. So vma->vm_page_prot would have >>> the SW >>> shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do >>> Write=0,Dirty=1 part. It seems like it should work. >>> >> >> Right, if we include it in vma->vm_page_prot, we'd immediately let >> mk_pte() just handle that. >> >> Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma >> instead >> of the vma->vm_page_prot. Let's see if we can avoid that for now. >> >>>> >>>> pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions >>>> based >>>> on the "shadowstack" PTE flag and hide all these details from >>>> core- >>>> mm. >>>> >>>> When mapping a shadowstack page (new page, migration, swapin, >>>> ...), >>>> which can be obtained by looking at the VMA flags, the first >>>> thing >>>> you'd >>>> do is set the "shadowstack" PTE flag. >>> >>> I guess the downside is that it uses an extra software bit. But the >>> other positive is that it's less error prone, so that someone >>> writing >>> core-mm code won't introduce a change that makes shadow stack VMAs >>> Write=1 if they don't know to also check for VM_SHADOW_STACK. >> >> Right. And I think this mimics the what I would have expected HW to >> provide: a dedicated HW bit, not somehow mangling this into semantics >> of >> existing bits. > > Yea. > >> >> Roughly speaking: if we abstract it that way and get all of the "how >> to >> set it writable now?" out of core-MM, it not only is cleaner and >> less >> error prone, it might even allow other architectures that implement >> something comparable (e.g., using a dedicated HW bit) to actually >> reuse >> some of that work. Otherwise most of that "shstk" is really just x86 >> specific ... >> >> I guess the only cases we have to special case would be page pinning >> code where pte_write() would indicate that the PTE is writable (well, >> it >> is, just not by "ordinary CPU instruction" context directly): but you >> do >> that already, so ... :) >> >> Sorry for stumbling over that this late, I only started looking into >> this when you CCed me on that one patch. > > Sorry for not calling more attention to it earlier. Appreciate your > comments. > > Previously versions of this series had changed some of these > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a vma. > This way an x86 implementation could use the VM_SHADOW_STACK vma flag > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The feedback > was that in some of these code paths "maybe" isn't really an option, it > *needs* to make it writable. Even though the logic was the same, the > name of the function made it look wrong. > > But another option could be to change pte_mkwrite() to take a vma. This > would save using another software bit on x86, but instead requires a > small change to each arch's pte_mkwrite(). I played with that idea shortly as well, but discarded it. I was not able to convince myself that it wouldn't be required to pass in the VMA as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... which would end up fairly ugly (or even impossible in thing slike GUP-fast). For example, I wonder how we'd be handling stuff like do_numa_page() cleanly correctly, where we use pte_modify() + pte_mkwrite(), and either call might set the PTE writable and maintain dirty bit ... Having that said, maybe it could work with only a single saved-dirty bit and passing in the VMA for pte_mkwrite() only. pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty bit to the soft-dirty bit instead, resulting in "writable=0,dirty=0,saved-dirty=1", pte_dirty() would return dirty==1||saved-dirty==1. pte_mkdirty() would set either set dirty=1 or saved-dirty=1, depending on the writable bit. pte_mkclean() would clean both bits. pte_write() would detect "writable == 1 || (writable==0 && dirty==1)" pte_mkwrite() would act according to the VMA, and in addition, merge the saved-dirty bit into the dirty bit. pte_modify() and mk_pte() .... would require more thought ... Further, ptep_modify_prot_commit() might have to be adjusted to properly flush in all relevant cases IIRC. > > x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), but > maybe it could additionally warn if the vma is not writable. It also > seems more aligned with your changes to stop taking hints from PTE bits > and just look at the VMA? (I'm thinking about the dropping of the dirty > check in GUP and dropping pte_saved_write()) The soft-shstk bit wouldn't be a hint, it would be logically changing the "type" of the PTE such that any other PTE functions can do the right thing without having to consume the VMA.
On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote: > > > Roughly speaking: if we abstract it that way and get all of the > > > "how > > > to > > > set it writable now?" out of core-MM, it not only is cleaner and > > > less > > > error prone, it might even allow other architectures that > > > implement > > > something comparable (e.g., using a dedicated HW bit) to actually > > > reuse > > > some of that work. Otherwise most of that "shstk" is really just > > > x86 > > > specific ... > > > > > > I guess the only cases we have to special case would be page > > > pinning > > > code where pte_write() would indicate that the PTE is writable > > > (well, > > > it > > > is, just not by "ordinary CPU instruction" context directly): but > > > you > > > do > > > that already, so ... :) > > > > > > Sorry for stumbling over that this late, I only started looking > > > into > > > this when you CCed me on that one patch. > > > > Sorry for not calling more attention to it earlier. Appreciate your > > comments. > > > > Previously versions of this series had changed some of these > > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a > > vma. > > This way an x86 implementation could use the VM_SHADOW_STACK vma > > flag > > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The > > feedback > > was that in some of these code paths "maybe" isn't really an > > option, it > > *needs* to make it writable. Even though the logic was the same, > > the > > name of the function made it look wrong. > > > > But another option could be to change pte_mkwrite() to take a vma. > > This > > would save using another software bit on x86, but instead requires > > a > > small change to each arch's pte_mkwrite(). > > I played with that idea shortly as well, but discarded it. I was not > able to convince myself that it wouldn't be required to pass in the > VMA > as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... > which would end up fairly ugly (or even impossible in thing slike > GUP-fast). > > For example, I wonder how we'd be handling stuff like do_numa_page() > cleanly correctly, where we use pte_modify() + pte_mkwrite(), and > either > call might set the PTE writable and maintain dirty bit ... pte_modify() is handled like this currently: https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/ There has been a couple iterations on that. The current solution is to do the Dirty->SavedDirty fixup if needed after the new prots are added. Of course pte_modify() can't know whether you are are attempting to create a shadow stack PTE with the prot you are passing in. But the callers today explicitly call pte_mkwrite() after filling in the other bits with pte_modify(). Today this patch causes the pte_mkwrite() to be skipped and another fault may be required in the mprotect() and numa cases, but if we change pte_mkwrite() to take a VMA we can just make it shadow stack to start. It might be worth mentioning, there was a suggestion in the past to try to have the shadow stack bits come out of vm_get_page_prot(), but MM code would then try to map the zero page as (shadow stack) writable when there was a normal (non-shadow stack) read access. So I had to abandon that approach and rely on explicit calls to pte_mkwrite/shstk() to make it shadow stack. > > Having that said, maybe it could work with only a single saved-dirty > bit > and passing in the VMA for pte_mkwrite() only. > > pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty > bit > to the soft-dirty bit instead, resulting in > "writable=0,dirty=0,saved-dirty=1", > > pte_dirty() would return dirty==1||saved-dirty==1. > > pte_mkdirty() would set either set dirty=1 or saved-dirty=1, > depending > on the writable bit. > > pte_mkclean() would clean both bits. > > pte_write() would detect "writable == 1 || (writable==0 && dirty==1)" > > pte_mkwrite() would act according to the VMA, and in addition, merge > the > saved-dirty bit into the dirty bit. > > pte_modify() and mk_pte() .... would require more thought ... Not sure I'm following what the mk_pte() problem would be. You mean if Write=0,Dirty=1 is manually added to the prot? Shouldn't people generally use the pte_mkwrite() helpers unless they are drawing from a prot that was already created with the helpers or vm_get_page_prot()? I think they can't manually create prot's from bits in core mm code, right? And x86 arch code already has to be aware of shadow stack. It's a bit of an assumption I guess, but I think maybe not too crazy of one? > > > Further, ptep_modify_prot_commit() might have to be adjusted to > properly > flush in all relevant cases IIRC. Sorry, I'm not following. Can you elaborate? There is an adjustment made in pte_flags_need_flush(). > > > > > x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), > > but > > maybe it could additionally warn if the vma is not writable. It > > also > > seems more aligned with your changes to stop taking hints from PTE > > bits > > and just look at the VMA? (I'm thinking about the dropping of the > > dirty > > check in GUP and dropping pte_saved_write()) > > The soft-shstk bit wouldn't be a hint, it would be logically > changing > the "type" of the PTE such that any other PTE functions can do the > right > thing without having to consume the VMA. Yea, true. Thanks for your comments and ideas here, I'll give the: pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) ...solution a try.
On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote: > Thanks for your comments and ideas here, I'll give the: > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) > ...solution a try. Well, it turns out there are some pte_mkwrite() callers in other arch's that operate on kernel memory and don't have a VMA. So it needed a new function that can be overridden in arch code. I ended up with x86 versions of these, like this: pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) { if (!(vma->vm_flags & VM_WRITE)) return pte; if (vma->vm_flags & VM_SHADOW_STACK) return pte_mkwrite_shstk(pte); return pte_mkwrite(pte); } pte_t pte_mkwrite_vma(pte_t pte, struct vm_area_struct *vma) { if (vma->vm_flags & VM_SHADOW_STACK) return pte_mkwrite_shstk(pte); return pte_mkwrite(pte); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) { if (!(vma->vm_flags & VM_WRITE)) return pmd; if (vma->vm_flags & VM_SHADOW_STACK) return pmd_mkwrite_shstk(pmd); return pmd_mkwrite(pmd); } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ All the other pte_mkdirty()s, etc remain the same. Previously, there was a suggestion to not override the maybe_mkwrite()'s and put the logic in core MM by having a generic version of pte_mkwrite_shstk() that does nothing. But given what we are trying to do with pte_mkwrite_vma() it seemed better to hide all the shadow stack PTE changes in arch code again. After the changes, the only shadow stack specific bits in core mm are the bit in GUP to require FOLL_FORCE, the memory accounting, and these warnings: https://lore.kernel.org/lkml/20230119212317.8324-26-rick.p.edgecombe@intel.com/
On 26.01.23 01:59, Edgecombe, Rick P wrote: > On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote: >> Thanks for your comments and ideas here, I'll give the: >> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) >> ...solution a try. > > Well, it turns out there are some pte_mkwrite() callers in other arch's > that operate on kernel memory and don't have a VMA. So it needed a new Why not pass in NULL as VMA then and document the semantics? The less similarly named but slightly different functions, the better :)
On 25.01.23 19:43, Edgecombe, Rick P wrote: > On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote: >>>> Roughly speaking: if we abstract it that way and get all of the >>>> "how >>>> to >>>> set it writable now?" out of core-MM, it not only is cleaner and >>>> less >>>> error prone, it might even allow other architectures that >>>> implement >>>> something comparable (e.g., using a dedicated HW bit) to actually >>>> reuse >>>> some of that work. Otherwise most of that "shstk" is really just >>>> x86 >>>> specific ... >>>> >>>> I guess the only cases we have to special case would be page >>>> pinning >>>> code where pte_write() would indicate that the PTE is writable >>>> (well, >>>> it >>>> is, just not by "ordinary CPU instruction" context directly): but >>>> you >>>> do >>>> that already, so ... :) >>>> >>>> Sorry for stumbling over that this late, I only started looking >>>> into >>>> this when you CCed me on that one patch. >>> >>> Sorry for not calling more attention to it earlier. Appreciate your >>> comments. >>> >>> Previously versions of this series had changed some of these >>> pte_mkwrite() calls to maybe_mkwrite(), which of course takes a >>> vma. >>> This way an x86 implementation could use the VM_SHADOW_STACK vma >>> flag >>> to decide between pte_mkwrite() and pte_mkwrite_shstk(). The >>> feedback >>> was that in some of these code paths "maybe" isn't really an >>> option, it >>> *needs* to make it writable. Even though the logic was the same, >>> the >>> name of the function made it look wrong. >>> >>> But another option could be to change pte_mkwrite() to take a vma. >>> This >>> would save using another software bit on x86, but instead requires >>> a >>> small change to each arch's pte_mkwrite(). >> >> I played with that idea shortly as well, but discarded it. I was not >> able to convince myself that it wouldn't be required to pass in the >> VMA >> as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... >> which would end up fairly ugly (or even impossible in thing slike >> GUP-fast). >> >> For example, I wonder how we'd be handling stuff like do_numa_page() >> cleanly correctly, where we use pte_modify() + pte_mkwrite(), and >> either >> call might set the PTE writable and maintain dirty bit ... > > pte_modify() is handled like this currently: > > https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/ > > There has been a couple iterations on that. The current solution is to > do the Dirty->SavedDirty fixup if needed after the new prots are added. > > Of course pte_modify() can't know whether you are are attempting to > create a shadow stack PTE with the prot you are passing in. But the > callers today explicitly call pte_mkwrite() after filling in the other > bits with pte_modify(). See below on my MAP_PRIVATE vs. MAP_SHARED comment. > Today this patch causes the pte_mkwrite() to be > skipped and another fault may be required in the mprotect() and numa > cases, but if we change pte_mkwrite() to take a VMA we can just make it > shadow stack to start. > > It might be worth mentioning, there was a suggestion in the past to try > to have the shadow stack bits come out of vm_get_page_prot(), but MM > code would then try to map the zero page as (shadow stack) writable > when there was a normal (non-shadow stack) read access. So I had to > abandon that approach and rely on explicit calls to pte_mkwrite/shstk() > to make it shadow stack. Thanks, do you have a pointer? > >> >> Having that said, maybe it could work with only a single saved-dirty >> bit >> and passing in the VMA for pte_mkwrite() only. >> >> pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty >> bit >> to the soft-dirty bit instead, resulting in >> "writable=0,dirty=0,saved-dirty=1", >> >> pte_dirty() would return dirty==1||saved-dirty==1. >> >> pte_mkdirty() would set either set dirty=1 or saved-dirty=1, >> depending >> on the writable bit. >> >> pte_mkclean() would clean both bits. >> >> pte_write() would detect "writable == 1 || (writable==0 && dirty==1)" >> >> pte_mkwrite() would act according to the VMA, and in addition, merge >> the >> saved-dirty bit into the dirty bit. >> >> pte_modify() and mk_pte() .... would require more thought ... > > Not sure I'm following what the mk_pte() problem would be. You mean if > Write=0,Dirty=1 is manually added to the prot? > > Shouldn't people generally use the pte_mkwrite() helpers unless they > are drawing from a prot that was already created with the helpers or > vm_get_page_prot()? pte_mkwrite() is mostly only used (except for writenotify ...) for MAP_PRIVATE memory ("COW-able"). For MAP_SHARED memory, vma->vm_page_prot in a VM_WRITE mapping already contains the write permissions. pte_mkwrite() is not necessary (again, unless writenotify is active). I assume shstk VMAs don't apply to MAP_SHARED VMAs, which is why you didn't stumble over that issue yet? Because I don't see how it could work with MAP_SHARED VMAs. The other thing I had in mind was that we have to make sure that we're not accidentally setting "Write=0,Dirty=1" in mk_pte() / pte_modify(). Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect using pte_modify(), we have to make sure to move the dirty bit to the saved_dirty bit. > I think they can't manually create prot's from bits > in core mm code, right? And x86 arch code already has to be aware of > shadow stack. It's a bit of an assumption I guess, but I think maybe > not too crazy of one? I think that's true. Arch code is supposed to deal with that IIRC. > >> >> >> Further, ptep_modify_prot_commit() might have to be adjusted to >> properly >> flush in all relevant cases IIRC. > > Sorry, I'm not following. Can you elaborate? There is an adjustment > made in pte_flags_need_flush(). Note that I did not fully review all bits of this patch set, just throwing out what was on my mind. If already handled, great. > >> >>> >>> x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), >>> but >>> maybe it could additionally warn if the vma is not writable. It >>> also >>> seems more aligned with your changes to stop taking hints from PTE >>> bits >>> and just look at the VMA? (I'm thinking about the dropping of the >>> dirty >>> check in GUP and dropping pte_saved_write()) >> >> The soft-shstk bit wouldn't be a hint, it would be logically >> changing >> the "type" of the PTE such that any other PTE functions can do the >> right >> thing without having to consume the VMA. > > Yea, true. > > Thanks for your comments and ideas here, I'll give the: > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) > ...solution a try. Good!
On Thu, 2023-01-26 at 09:57 +0100, David Hildenbrand wrote: > On 25.01.23 19:43, Edgecombe, Rick P wrote: > > On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote: > > > > > Roughly speaking: if we abstract it that way and get all of > > > > > the > > > > > "how > > > > > to > > > > > set it writable now?" out of core-MM, it not only is cleaner > > > > > and > > > > > less > > > > > error prone, it might even allow other architectures that > > > > > implement > > > > > something comparable (e.g., using a dedicated HW bit) to > > > > > actually > > > > > reuse > > > > > some of that work. Otherwise most of that "shstk" is really > > > > > just > > > > > x86 > > > > > specific ... > > > > > > > > > > I guess the only cases we have to special case would be page > > > > > pinning > > > > > code where pte_write() would indicate that the PTE is > > > > > writable > > > > > (well, > > > > > it > > > > > is, just not by "ordinary CPU instruction" context directly): > > > > > but > > > > > you > > > > > do > > > > > that already, so ... :) > > > > > > > > > > Sorry for stumbling over that this late, I only started > > > > > looking > > > > > into > > > > > this when you CCed me on that one patch. > > > > > > > > Sorry for not calling more attention to it earlier. Appreciate > > > > your > > > > comments. > > > > > > > > Previously versions of this series had changed some of these > > > > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a > > > > vma. > > > > This way an x86 implementation could use the VM_SHADOW_STACK > > > > vma > > > > flag > > > > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The > > > > feedback > > > > was that in some of these code paths "maybe" isn't really an > > > > option, it > > > > *needs* to make it writable. Even though the logic was the > > > > same, > > > > the > > > > name of the function made it look wrong. > > > > > > > > But another option could be to change pte_mkwrite() to take a > > > > vma. > > > > This > > > > would save using another software bit on x86, but instead > > > > requires > > > > a > > > > small change to each arch's pte_mkwrite(). > > > > > > I played with that idea shortly as well, but discarded it. I was > > > not > > > able to convince myself that it wouldn't be required to pass in > > > the > > > VMA > > > as well for things like pte_dirty(), pte_mkdirty(), pte_write(), > > > ... > > > which would end up fairly ugly (or even impossible in thing slike > > > GUP-fast). > > > > > > For example, I wonder how we'd be handling stuff like > > > do_numa_page() > > > cleanly correctly, where we use pte_modify() + pte_mkwrite(), and > > > either > > > call might set the PTE writable and maintain dirty bit ... > > > > pte_modify() is handled like this currently: > > > > https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/ > > > > There has been a couple iterations on that. The current solution is > > to > > do the Dirty->SavedDirty fixup if needed after the new prots are > > added. > > > > Of course pte_modify() can't know whether you are are attempting to > > create a shadow stack PTE with the prot you are passing in. But the > > callers today explicitly call pte_mkwrite() after filling in the > > other > > bits with pte_modify(). > > See below on my MAP_PRIVATE vs. MAP_SHARED comment. Yep, MAP_SHARED support was dropped with the reboot of the series. It did have some problems IIRC. Now shadow stack memory creation is tightly controlled. Either created via special syscall or automatically with a new thread. > > > Today this patch causes the pte_mkwrite() to be > > skipped and another fault may be required in the mprotect() and > > numa > > cases, but if we change pte_mkwrite() to take a VMA we can just > > make it > > shadow stack to start. > > > > It might be worth mentioning, there was a suggestion in the past to > > try > > to have the shadow stack bits come out of vm_get_page_prot(), but > > MM > > code would then try to map the zero page as (shadow stack) writable > > when there was a normal (non-shadow stack) read access. So I had to > > abandon that approach and rely on explicit calls to > > pte_mkwrite/shstk() > > to make it shadow stack. > > Thanks, do you have a pointer? I never posted it because it didn't work out. This was the comment that prompted the exploration in that direction: https://lore.kernel.org/lkml/8065c333-0911-04a2-f91e-7c2e0cc7ec51@intel.com/ Shadow stack memory also used to not be VM_WRITE (VM_SHADOW_STACK only), but this was changed for other reasons. In v2 there were some updates to how shadow stack memory was handled, and the cover letter had a writeup of the reasons and general design: https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/ > > > > > > > > > Having that said, maybe it could work with only a single saved- > > > dirty > > > bit > > > and passing in the VMA for pte_mkwrite() only. > > > > > > pte_wrprotect() would detect "writable=0,dirty=1" and move the > > > dirty > > > bit > > > to the soft-dirty bit instead, resulting in > > > "writable=0,dirty=0,saved-dirty=1", > > > > > > pte_dirty() would return dirty==1||saved-dirty==1. > > > > > > pte_mkdirty() would set either set dirty=1 or saved-dirty=1, > > > depending > > > on the writable bit. > > > > > > pte_mkclean() would clean both bits. > > > > > > pte_write() would detect "writable == 1 || (writable==0 && > > > dirty==1)" > > > > > > pte_mkwrite() would act according to the VMA, and in addition, > > > merge > > > the > > > saved-dirty bit into the dirty bit. > > > > > > pte_modify() and mk_pte() .... would require more thought ... > > > > Not sure I'm following what the mk_pte() problem would be. You mean > > if > > Write=0,Dirty=1 is manually added to the prot? > > > > Shouldn't people generally use the pte_mkwrite() helpers unless > > they > > are drawing from a prot that was already created with the helpers > > or > > vm_get_page_prot()? > > pte_mkwrite() is mostly only used (except for writenotify ...) for > MAP_PRIVATE memory ("COW-able"). For MAP_SHARED memory, > vma->vm_page_prot in a VM_WRITE mapping already contains the write > permissions. pte_mkwrite() is not necessary (again, unless > writenotify > is active). Oh, interesting. > > I assume shstk VMAs don't apply to MAP_SHARED VMAs, which is why you > didn't stumble over that issue yet? Because I don't see how it could > work with MAP_SHARED VMAs. Yep, it doesn't support MAP_SHARED. > > > The other thing I had in mind was that we have to make sure that > we're > not accidentally setting "Write=0,Dirty=1" in mk_pte() / > pte_modify(). > > Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect > using pte_modify(), we have to make sure to move the dirty bit to > the > saved_dirty bit. For the mk_pte() case, I don't think a Write=0,Dirty=1 prot could come from anywhere. I guess the MAP_SHARED case is a little less bounded. We could maybe add a warning for this case. For the pte_modify() case, this does happen. There are two scenarios considered: 1. A Write=0,Dirty=0 PTE is made dirty. This can't happen today as Dirty is filtered via _PAGE_CHG_MASK. Basically pte_modify() doesn't support it. 2. A Write=1,Dirty=1 PTE gets write protected. This does happen because the Write=0 prot comes from protection_map, and pte_modify() would leave the Dirty=1 bit alone. The main case I know of is mprotect(). It is handled by changes to pte_modify() by doing the Dirty->SoftDirty fixup if needed. So pte_modify()s job should not be too tricky. What you can't do with it though, is create shadow stack PTEs. But it is ok for our uses because of the explicit mkwrite(). > > > I think they can't manually create prot's from bits > > in core mm code, right? And x86 arch code already has to be aware > > of > > shadow stack. It's a bit of an assumption I guess, but I think > > maybe > > not too crazy of one? > > I think that's true. Arch code is supposed to deal with that IIRC. > > > > > > > > > > > > Further, ptep_modify_prot_commit() might have to be adjusted to > > > properly > > > flush in all relevant cases IIRC. > > > > Sorry, I'm not following. Can you elaborate? There is an adjustment > > made in pte_flags_need_flush(). > > Note that I did not fully review all bits of this patch set, just > throwing out what was on my mind. If already handled, great. > > > > > > > > > > > > > > x86's pte_mkwrite() would then be pretty close to > > > > maybe_mkwrite(), > > > > but > > > > maybe it could additionally warn if the vma is not writable. It > > > > also > > > > seems more aligned with your changes to stop taking hints from > > > > PTE > > > > bits > > > > and just look at the VMA? (I'm thinking about the dropping of > > > > the > > > > dirty > > > > check in GUP and dropping pte_saved_write()) > > > > > > The soft-shstk bit wouldn't be a hint, it would be logically > > > changing > > > the "type" of the PTE such that any other PTE functions can do > > > the > > > right > > > thing without having to consume the VMA. > > > > Yea, true. > > > > Thanks for your comments and ideas here, I'll give the: > > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) > > ...solution a try. > > Good! >
On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote: > On 26.01.23 01:59, Edgecombe, Rick P wrote: > > On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote: > > > Thanks for your comments and ideas here, I'll give the: > > > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) > > > ...solution a try. > > > > Well, it turns out there are some pte_mkwrite() callers in other > > arch's > > that operate on kernel memory and don't have a VMA. So it needed a > > new > > Why not pass in NULL as VMA then and document the semantics? The > less > similarly named but slightly different functions, the better :) Hmm. The x86 and generic versions should probably have the same semantics, so then if you pass a NULL, it would do a regular pte_mkwrite() I guess? I see another benefit of requiring the vma argument, such that raw pte_mkwrite()s are less likely to appear in core MM code. But I think the NULL is awkward because it's not obvious, to me at least, what the implications of that should be. So it will be confusing to read in the NULL cases for the other archs. We also have some warnings to catch miss cases in the PTE tear down code, so the scenario of new code accidentally marking shadow stack PTEs as writable is not totally unchecked. The three functions that do slightly different things are: pte_mkwrite(): Makes a PTE conventionally writable, only takes a PTE. Very clear that it is a low level helper and what it does. maybe_mkwrite(): Might make a PTE writable if the VMA allows it. pte_mkwrite_vma(): Makes a PTE writable in a specific way depending on the VMA I wonder if the name pte_mkwrite_vma() is maybe just not clear enough. It takes a VMA, yes, but what does it do with it? What if it was called pte_mkwrite_type() instead? Some arch's have additional types of writable memory and this function creates them. Of course they also have the normal type of writable memory, and pte_mkwrite() creates that like usual. Doesn't it seem more readable?
On 26.01.23 21:19, Edgecombe, Rick P wrote: > On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote: >> On 26.01.23 01:59, Edgecombe, Rick P wrote: >>> On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote: >>>> Thanks for your comments and ideas here, I'll give the: >>>> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) >>>> ...solution a try. >>> >>> Well, it turns out there are some pte_mkwrite() callers in other >>> arch's >>> that operate on kernel memory and don't have a VMA. So it needed a >>> new >> >> Why not pass in NULL as VMA then and document the semantics? The >> less >> similarly named but slightly different functions, the better :) > > Hmm. The x86 and generic versions should probably have the same > semantics, so then if you pass a NULL, it would do a regular > pte_mkwrite() I guess? > > I see another benefit of requiring the vma argument, such that raw > pte_mkwrite()s are less likely to appear in core MM code. But I think > the NULL is awkward because it's not obvious, to me at least, what the > implications of that should be. > > So it will be confusing to read in the NULL cases for the other archs. > We also have some warnings to catch miss cases in the PTE tear down > code, so the scenario of new code accidentally marking shadow stack > PTEs as writable is not totally unchecked. > > The three functions that do slightly different things are: > > pte_mkwrite(): > Makes a PTE conventionally writable, only takes a PTE. Very clear that > it is a low level helper and what it does. > > maybe_mkwrite(): > Might make a PTE writable if the VMA allows it. > > pte_mkwrite_vma(): > Makes a PTE writable in a specific way depending on the VMA > > I wonder if the name pte_mkwrite_vma() is maybe just not clear enough. > It takes a VMA, yes, but what does it do with it? > > What if it was called pte_mkwrite_type() instead? Some arch's have > additional types of writable memory and this function creates them. Of > course they also have the normal type of writable memory, and > pte_mkwrite() creates that like usual. Doesn't it seem more readable? The issue is, the more variants we provide the easier it is to make mistakes and introduce new buggy code. It's tempting to simply use pte_mkwrite() and call it a day, where people actually should use pte_mkwrite_vma(). Then, they at least have to investigate what to do about the second VMA parameter.
> > Now shadow stack memory creation is tightly controlled. Either created > via special syscall or automatically with a new thread. Good, it would be valuable to document that somewhere ("Neve rapplies to VM_SHARED|VM_MAYSHARE VMAs"). [...] >> >> The other thing I had in mind was that we have to make sure that >> we're >> not accidentally setting "Write=0,Dirty=1" in mk_pte() / >> pte_modify(). >> >> Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect >> using pte_modify(), we have to make sure to move the dirty bit to >> the >> saved_dirty bit. > > For the mk_pte() case, I don't think a Write=0,Dirty=1 prot could come > from anywhere. I guess the MAP_SHARED case is a little less bounded. We > could maybe add a warning for this case. Right, Write=0,Dirty=1 shouldn't apply at that point if shstk are always wrprotected as default. > > For the pte_modify() case, this does happen. There are two scenarios > considered: > 1. A Write=0,Dirty=0 PTE is made dirty. This can't happen today as > Dirty is filtered via _PAGE_CHG_MASK. Basically pte_modify() doesn't > support it. It should simply set the saved_dirty bit I guess. But I don't think pte_modify() is actually supposed to set PTEs dirty (primary goal is to change protection IIRC). > 2. A Write=1,Dirty=1 PTE gets write protected. This does happen because > the Write=0 prot comes from protection_map, and pte_modify() would > leave the Dirty=1 bit alone. The main case I know of is mprotect(). It > is handled by changes to pte_modify() by doing the Dirty->SoftDirty > fixup if needed. Right, we'd have to move the dirty bit to the saved_dirty bit. (we have to handle soft-dirty, too, whenever setting the PTE dirty -- either via the dirty bit or via the saved_dirty bit) > > So pte_modify()s job should not be too tricky. What you can't do with > it though, is create shadow stack PTEs. But it is ok for our uses > because of the explicit mkwrite(). I think you are correct.
On Fri, 2023-01-27 at 17:12 +0100, David Hildenbrand wrote: > On 26.01.23 21:19, Edgecombe, Rick P wrote: > > On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote: > > > On 26.01.23 01:59, Edgecombe, Rick P wrote: > > > > On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote: > > > > > Thanks for your comments and ideas here, I'll give the: > > > > > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) > > > > > ...solution a try. > > > > > > > > Well, it turns out there are some pte_mkwrite() callers in > > > > other > > > > arch's > > > > that operate on kernel memory and don't have a VMA. So it > > > > needed a > > > > new > > > > > > Why not pass in NULL as VMA then and document the semantics? The > > > less > > > similarly named but slightly different functions, the better :) > > > > Hmm. The x86 and generic versions should probably have the same > > semantics, so then if you pass a NULL, it would do a regular > > pte_mkwrite() I guess? > > > > I see another benefit of requiring the vma argument, such that raw > > pte_mkwrite()s are less likely to appear in core MM code. But I > > think > > the NULL is awkward because it's not obvious, to me at least, what > > the > > implications of that should be. > > > > So it will be confusing to read in the NULL cases for the other > > archs. > > We also have some warnings to catch miss cases in the PTE tear down > > code, so the scenario of new code accidentally marking shadow stack > > PTEs as writable is not totally unchecked. > > > > The three functions that do slightly different things are: > > > > pte_mkwrite(): > > Makes a PTE conventionally writable, only takes a PTE. Very clear > > that > > it is a low level helper and what it does. > > > > maybe_mkwrite(): > > Might make a PTE writable if the VMA allows it. > > > > pte_mkwrite_vma(): > > Makes a PTE writable in a specific way depending on the VMA > > > > I wonder if the name pte_mkwrite_vma() is maybe just not clear > > enough. > > It takes a VMA, yes, but what does it do with it? > > > > What if it was called pte_mkwrite_type() instead? Some arch's have > > additional types of writable memory and this function creates them. > > Of > > course they also have the normal type of writable memory, and > > pte_mkwrite() creates that like usual. Doesn't it seem more > > readable? > > The issue is, the more variants we provide the easier it is to make > mistakes and introduce new buggy code. > > It's tempting to simply use pte_mkwrite() and call it a day, where > people actually should use pte_mkwrite_vma(). > > Then, they at least have to investigate what to do about the second > VMA > parameter. Ok, I'll give it a spin. So far it looks ok. The downside is the giant tree-wide pte_mkwrite() signature change, but once that is over with there are other advantages. Like getting rid of maybe_mkwrite()'s awareness of shadow stack so the logic is more centralized. Please let me know if you don't feel comfortable with a suggested-by credit tag. Thanks, Rick
On 28.01.23 01:51, Edgecombe, Rick P wrote: > On Fri, 2023-01-27 at 17:12 +0100, David Hildenbrand wrote: >> On 26.01.23 21:19, Edgecombe, Rick P wrote: >>> On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote: >>>> On 26.01.23 01:59, Edgecombe, Rick P wrote: >>>>> On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote: >>>>>> Thanks for your comments and ideas here, I'll give the: >>>>>> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) >>>>>> ...solution a try. >>>>> >>>>> Well, it turns out there are some pte_mkwrite() callers in >>>>> other >>>>> arch's >>>>> that operate on kernel memory and don't have a VMA. So it >>>>> needed a >>>>> new >>>> >>>> Why not pass in NULL as VMA then and document the semantics? The >>>> less >>>> similarly named but slightly different functions, the better :) >>> >>> Hmm. The x86 and generic versions should probably have the same >>> semantics, so then if you pass a NULL, it would do a regular >>> pte_mkwrite() I guess? >>> >>> I see another benefit of requiring the vma argument, such that raw >>> pte_mkwrite()s are less likely to appear in core MM code. But I >>> think >>> the NULL is awkward because it's not obvious, to me at least, what >>> the >>> implications of that should be. >>> >>> So it will be confusing to read in the NULL cases for the other >>> archs. >>> We also have some warnings to catch miss cases in the PTE tear down >>> code, so the scenario of new code accidentally marking shadow stack >>> PTEs as writable is not totally unchecked. >>> >>> The three functions that do slightly different things are: >>> >>> pte_mkwrite(): >>> Makes a PTE conventionally writable, only takes a PTE. Very clear >>> that >>> it is a low level helper and what it does. >>> >>> maybe_mkwrite(): >>> Might make a PTE writable if the VMA allows it. >>> >>> pte_mkwrite_vma(): >>> Makes a PTE writable in a specific way depending on the VMA >>> >>> I wonder if the name pte_mkwrite_vma() is maybe just not clear >>> enough. >>> It takes a VMA, yes, but what does it do with it? >>> >>> What if it was called pte_mkwrite_type() instead? Some arch's have >>> additional types of writable memory and this function creates them. >>> Of >>> course they also have the normal type of writable memory, and >>> pte_mkwrite() creates that like usual. Doesn't it seem more >>> readable? >> >> The issue is, the more variants we provide the easier it is to make >> mistakes and introduce new buggy code. >> >> It's tempting to simply use pte_mkwrite() and call it a day, where >> people actually should use pte_mkwrite_vma(). >> >> Then, they at least have to investigate what to do about the second >> VMA >> parameter. > > Ok, I'll give it a spin. So far it looks ok. The downside is the giant > tree-wide pte_mkwrite() signature change, but once that is over with > there are other advantages. Like getting rid of maybe_mkwrite()'s > awareness of shadow stack so the logic is more centralized. Please let > me know if you don't feel comfortable with a suggested-by credit tag. Sure ... but I reconsidered :) Maybe there is a cleaner way to do it and avoid the "NULL" argument. What about having (while you're going over everything already): pte_mkwrite(pte, vma) pte_mkwrite_kernel(pte) The latter would only be used in that arch code where we're working on kernel pgtables. We already have pte_offset_kernel() and pte_alloc_kernel_track(), so it's not too weird.
On Tue, 2023-01-31 at 09:46 +0100, David Hildenbrand wrote: > Sure ... > > but I reconsidered :) > > Maybe there is a cleaner way to do it and avoid the "NULL" argument. > > What about having (while you're going over everything already): > > pte_mkwrite(pte, vma) > pte_mkwrite_kernel(pte) > > The latter would only be used in that arch code where we're working > on > kernel pgtables. We already have pte_offset_kernel() and > pte_alloc_kernel_track(), so it's not too weird. Hmm, one downside is the "mk" part might lead people to guess pte_mkwrite_kernel() would make it writable AND a kernel page (like U/S=0 on x86). Instead of being a mkwrite() that's useful for setting on kernel PTEs. The other problem is that one of NULL passers is not for kernel memory. huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't be created with MAP_HUGETLB, so it is not needed. Using pte_mkwrite_kernel() would look weird in this case, but making huge_pte_mkwrite() take a VMA would be for no reason. Maybe making huge_pte_mkwrite() take a VMA is the better of those two options. Or keep the NULL semantics... Any thoughts?
On 01.02.23 00:33, Edgecombe, Rick P wrote: > On Tue, 2023-01-31 at 09:46 +0100, David Hildenbrand wrote: >> Sure ... >> >> but I reconsidered :) >> >> Maybe there is a cleaner way to do it and avoid the "NULL" argument. >> >> What about having (while you're going over everything already): >> >> pte_mkwrite(pte, vma) >> pte_mkwrite_kernel(pte) >> >> The latter would only be used in that arch code where we're working >> on >> kernel pgtables. We already have pte_offset_kernel() and >> pte_alloc_kernel_track(), so it's not too weird. > > Hmm, one downside is the "mk" part might lead people to guess > pte_mkwrite_kernel() would make it writable AND a kernel page (like > U/S=0 on x86). Instead of being a mkwrite() that's useful for setting > on kernel PTEs. At least I wouldn't worry about that too much. We handle nowhere in common code user vs. supervisor access that way explicitly (e.g., mkkernel), and it wouldn't even apply on architectures where we cannot make such a decision on a per-PTE basis. > > The other problem is that one of NULL passers is not for kernel memory. > huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't be > created with MAP_HUGETLB, so it is not needed. Using > pte_mkwrite_kernel() would look weird in this case, but making > huge_pte_mkwrite() take a VMA would be for no reason. Maybe making > huge_pte_mkwrite() take a VMA is the better of those two options. Or > keep the NULL semantics... Any thoughts? Well, the reason would be consistency. From a core-mm point of view it makes sense to handle this all consistency, even if the single user (x86) wouldn't strictly require it right now. I'd just pass in the VMA and call it a day :)
On Wed, 2023-02-01 at 10:03 +0100, David Hildenbrand wrote: > > > > The other problem is that one of NULL passers is not for kernel > > memory. > > huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't > > be > > created with MAP_HUGETLB, so it is not needed. Using > > pte_mkwrite_kernel() would look weird in this case, but making > > huge_pte_mkwrite() take a VMA would be for no reason. Maybe making > > huge_pte_mkwrite() take a VMA is the better of those two options. > > Or > > keep the NULL semantics... Any thoughts? > > Well, the reason would be consistency. From a core-mm point of view > it > makes sense to handle this all consistency, even if the single user > (x86) wouldn't strictly require it right now. > > I'd just pass in the VMA and call it a day :) Ok, I'll give it a spin.
On 01.02.23 18:32, Edgecombe, Rick P wrote: > On Wed, 2023-02-01 at 10:03 +0100, David Hildenbrand wrote: >>> >>> The other problem is that one of NULL passers is not for kernel >>> memory. >>> huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't >>> be >>> created with MAP_HUGETLB, so it is not needed. Using >>> pte_mkwrite_kernel() would look weird in this case, but making >>> huge_pte_mkwrite() take a VMA would be for no reason. Maybe making >>> huge_pte_mkwrite() take a VMA is the better of those two options. >>> Or >>> keep the NULL semantics... Any thoughts? >> >> Well, the reason would be consistency. From a core-mm point of view >> it >> makes sense to handle this all consistency, even if the single user >> (x86) wouldn't strictly require it right now. >> >> I'd just pass in the VMA and call it a day :) > > Ok, I'll give it a spin. It would be good to get more opinions on that, but I'm afraid we won't get more deep down in this thread :)
diff --git a/include/linux/mm.h b/include/linux/mm.h index e15d2fc04007..139a682d243b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2181,7 +2181,7 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma */ if (vma->vm_flags & VM_SHARED) return vma_wants_writenotify(vma, vma->vm_page_prot); - return !!(vma->vm_flags & VM_WRITE); + return (vma->vm_flags & VM_WRITE) && !(vma->vm_flags & VM_SHADOW_STACK); } bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,