Message ID | 20230515130553.2311248-1-jeffxu@chromium.org (mailing list archive) |
---|---|
Headers | show |
Series | Memory Mapping (VMA) protection using PKU - set 1 | expand |
On 5/15/23 06:05, jeffxu@chromium.org wrote: > We're using PKU for in-process isolation to enforce control-flow integrity > for a JIT compiler. In our threat model, an attacker exploits a > vulnerability and has arbitrary read/write access to the whole process > space concurrently to other threads being executed. This attacker can > manipulate some arguments to syscalls from some threads. This all sounds like it hinges on the contents of PKRU in the attacker thread. Could you talk a bit about how the attacker is prevented from running WRPKRU, XRSTOR or compelling the kernel to write to PKRU like at sigreturn?
On Mon, May 15, 2023 at 4:28 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/15/23 06:05, jeffxu@chromium.org wrote: > > We're using PKU for in-process isolation to enforce control-flow integrity > > for a JIT compiler. In our threat model, an attacker exploits a > > vulnerability and has arbitrary read/write access to the whole process > > space concurrently to other threads being executed. This attacker can > > manipulate some arguments to syscalls from some threads. > > This all sounds like it hinges on the contents of PKRU in the attacker > thread. > > Could you talk a bit about how the attacker is prevented from running > WRPKRU, XRSTOR or compelling the kernel to write to PKRU like at sigreturn? (resending without html) Since we're using the feature for control-flow integrity, we assume the control-flow is still intact at this point. I.e. the attacker thread can't run arbitrary instructions. * For JIT code, we're going to scan it for wrpkru instructions before writing it to executable memory * For regular code, we only use wrpkru around short critical sections to temporarily enable write access Sigreturn is a separate problem that we hope to solve by adding pkey support to sigaltstack
On Mon, May 15, 2023 at 01:05:46PM +0000, jeffxu@chromium.org wrote: > This patch introduces a new flag, PKEY_ENFORCE_API, to the pkey_alloc() > function. When a PKEY is created with this flag, it is enforced that any > thread that wants to make changes to the memory mapping (such as mprotect) > of the memory must have write access to the PKEY. PKEYs created without > this flag will continue to work as they do now, for backwards > compatibility. > > Only PKEY created from user space can have the new flag set, the PKEY > allocated by the kernel internally will not have it. In other words, > ARCH_DEFAULT_PKEY(0) and execute_only_pkey won’t have this flag set, > and continue work as today. Cool! Yeah, this looks like it could become quite useful. I assume V8 folks are on board with this API, etc? > This set of patch covers mprotect/munmap, I plan to work on other > syscalls after this. Which ones are on your list currently?
On Tue, May 16, 2023 at 1:08 PM Kees Cook <keescook@chromium.org> wrote: > > On Mon, May 15, 2023 at 01:05:46PM +0000, jeffxu@chromium.org wrote: > > This patch introduces a new flag, PKEY_ENFORCE_API, to the pkey_alloc() > > function. When a PKEY is created with this flag, it is enforced that any > > thread that wants to make changes to the memory mapping (such as mprotect) > > of the memory must have write access to the PKEY. PKEYs created without > > this flag will continue to work as they do now, for backwards > > compatibility. > > > > Only PKEY created from user space can have the new flag set, the PKEY > > allocated by the kernel internally will not have it. In other words, > > ARCH_DEFAULT_PKEY(0) and execute_only_pkey won’t have this flag set, > > and continue work as today. > > Cool! Yeah, this looks like it could become quite useful. I assume > V8 folks are on board with this API, etc? > > > This set of patch covers mprotect/munmap, I plan to work on other > > syscalls after this. > > Which ones are on your list currently? > mprotect/mprotect_pkey/munmap mmap/mremap madvice,brk,sbrk Thanks! -Jeff Xu > -- > Kees Cook
On 5/16/23 15:17, Jeff Xu wrote: >>> This set of patch covers mprotect/munmap, I plan to work on other >>> syscalls after this. >> Which ones are on your list currently? >> > mprotect/mprotect_pkey/munmap > mmap/mremap > madvice,brk,sbrk What about pkey_free()? Without that, someone can presumably free the pkey and then reallocate it without PKEY_ENFORCE_API.
On 5/16/23 00:06, Stephen Röttger wrote: > On Mon, May 15, 2023 at 4:28 PM Dave Hansen <dave.hansen@intel.com> wrote: >> >> On 5/15/23 06:05, jeffxu@chromium.org wrote: >>> We're using PKU for in-process isolation to enforce control-flow integrity >>> for a JIT compiler. In our threat model, an attacker exploits a >>> vulnerability and has arbitrary read/write access to the whole process >>> space concurrently to other threads being executed. This attacker can >>> manipulate some arguments to syscalls from some threads. >> >> This all sounds like it hinges on the contents of PKRU in the attacker >> thread. >> >> Could you talk a bit about how the attacker is prevented from running >> WRPKRU, XRSTOR or compelling the kernel to write to PKRU like at sigreturn? > > (resending without html) > > Since we're using the feature for control-flow integrity, we assume > the control-flow is still intact at this point. I.e. the attacker > thread can't run arbitrary instructions. Can't run arbitrary instructions, but can make (pretty) arbitrary syscalls? > * For JIT code, we're going to scan it for wrpkru instructions before > writing it to executable memory ... and XRSTOR, right? > * For regular code, we only use wrpkru around short critical sections > to temporarily enable write access > > Sigreturn is a separate problem that we hope to solve by adding pkey > support to sigaltstack What kind of support were you planning to add? I was thinking that an attacker with arbitrary write access would wait until PKRU was on the userspace stack and *JUST* before the kernel sigreturn code restores it to write a malicious value. It could presumably do this with some asynchronous mechanism so that even if there was only one attacker thread, it could change its own value. Also, the kernel side respect for PKRU is ... well ... rather weak. It's a best effort and if we *happen* to be in a kernel context where PKRU is relevant, we can try to respect PKRU. But there are a whole bunch of things like get_user_pages_remote() that just plain don't have PKRU available and can't respect it at all. I think io_uring also greatly expanded how common "remote" access to process memory is. So, overall, I'm thrilled to see another potential user for pkeys. It sounds like there's an actual user lined up here, which would be wonderful. But, I also want to make sure we don't go to the trouble to build something that doesn't actually present meaningful, durable obstacles to an attacker. I also haven't more than glanced at the code.
On Tue, May 16, 2023 at 3:30 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/16/23 15:17, Jeff Xu wrote: > >>> This set of patch covers mprotect/munmap, I plan to work on other > >>> syscalls after this. > >> Which ones are on your list currently? > >> > > mprotect/mprotect_pkey/munmap > > mmap/mremap > > madvice,brk,sbrk > > What about pkey_free()? > > Without that, someone can presumably free the pkey and then reallocate > it without PKEY_ENFORCE_API. > Great catch. I will add it to the list. Thanks! -Jeff Xu >
On Tue, May 16, 2023 at 10:08 PM Kees Cook <keescook@chromium.org> wrote: > > On Mon, May 15, 2023 at 01:05:46PM +0000, jeffxu@chromium.org wrote: > > This patch introduces a new flag, PKEY_ENFORCE_API, to the pkey_alloc() > > function. When a PKEY is created with this flag, it is enforced that any > > thread that wants to make changes to the memory mapping (such as mprotect) > > of the memory must have write access to the PKEY. PKEYs created without > > this flag will continue to work as they do now, for backwards > > compatibility. > > > > Only PKEY created from user space can have the new flag set, the PKEY > > allocated by the kernel internally will not have it. In other words, > > ARCH_DEFAULT_PKEY(0) and execute_only_pkey won’t have this flag set, > > and continue work as today. > > Cool! Yeah, this looks like it could become quite useful. I assume > V8 folks are on board with this API, etc? Yes! (I'm from the v8 team driving the implementation on v8 side) > > This set of patch covers mprotect/munmap, I plan to work on other > > syscalls after this. > > Which ones are on your list currently? > > -- > Kees Cook
On Wed, May 17, 2023 at 12:41 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/16/23 00:06, Stephen Röttger wrote: > > On Mon, May 15, 2023 at 4:28 PM Dave Hansen <dave.hansen@intel.com> wrote: > >> > >> On 5/15/23 06:05, jeffxu@chromium.org wrote: > >>> We're using PKU for in-process isolation to enforce control-flow integrity > >>> for a JIT compiler. In our threat model, an attacker exploits a > >>> vulnerability and has arbitrary read/write access to the whole process > >>> space concurrently to other threads being executed. This attacker can > >>> manipulate some arguments to syscalls from some threads. > >> > >> This all sounds like it hinges on the contents of PKRU in the attacker > >> thread. > >> > >> Could you talk a bit about how the attacker is prevented from running > >> WRPKRU, XRSTOR or compelling the kernel to write to PKRU like at sigreturn? > > > > (resending without html) > > > > Since we're using the feature for control-flow integrity, we assume > > the control-flow is still intact at this point. I.e. the attacker > > thread can't run arbitrary instructions. > > Can't run arbitrary instructions, but can make (pretty) arbitrary syscalls? The threat model is that the attacker has arbitrary read/write, while other threads run in parallel. So whenever a regular thread performs a syscall and takes a syscall argument from memory, we assume that argument can be attacker controlled. Unfortunately, the line is a bit blurry which syscalls / syscall arguments we need to assume to be attacker controlled. We're trying to approach this by roughly categorizing syscalls+args: * how commonly used is the syscall * do we expect the argument to be taken from writable memory * can we restrict the syscall+args with seccomp * how difficult is it to restrict the syscall in userspace vs kernel * does the syscall affect our protections (e.g. change control-flow or pkey) Using munmap as an example: * it's a very common syscall (nearly every seccomp filter will allow munmap) * the addr argument will come from memory * unmapping pkey-tagged pages breaks our assumptions * it's hard to restrict in userspace since we'd need to keep track of all address ranges that are unsafe to unmap and hook the syscall to perform the validation on every call in the codebase. * it's easy to validate in kernel with this patch For most other syscalls, they either don't affect the control-flow, are easy to avoid and block with seccomp or we can add validation in userspace (e.g. only install signal handlers at program startup). > > * For JIT code, we're going to scan it for wrpkru instructions before > > writing it to executable memory > > ... and XRSTOR, right? Right. We’ll just have a list of allowed instructions that the JIT compiler can emit. > > > * For regular code, we only use wrpkru around short critical sections > > to temporarily enable write access > > > > Sigreturn is a separate problem that we hope to solve by adding pkey > > support to sigaltstack > > What kind of support were you planning to add? We’d like to allow registering pkey-tagged memory as a sigaltstack. This would allow the signal handler to run isolated from other threads. Right now, the main reason this doesn’t work is that the kernel would need to change the pkru state before storing the register state on the stack. > I was thinking that an attacker with arbitrary write access would wait > until PKRU was on the userspace stack and *JUST* before the kernel > sigreturn code restores it to write a malicious value. It could > presumably do this with some asynchronous mechanism so that even if > there was only one attacker thread, it could change its own value. I’m not sure I follow the details, can you give an example of an asynchronous mechanism to do this? E.g. would this be the kernel writing to the memory in a syscall for example? > Also, the kernel side respect for PKRU is ... well ... rather weak. > It's a best effort and if we *happen* to be in a kernel context where > PKRU is relevant, we can try to respect PKRU. But there are a whole > bunch of things like get_user_pages_remote() that just plain don't have > PKRU available and can't respect it at all. > > I think io_uring also greatly expanded how common "remote" access to > process memory is. > > So, overall, I'm thrilled to see another potential user for pkeys. It > sounds like there's an actual user lined up here, which would be > wonderful. But, I also want to make sure we don't go to the trouble to > build something that doesn't actually present meaningful, durable > obstacles to an attacker. > > I also haven't more than glanced at the code.
On 5/17/23 03:51, Stephen Röttger wrote: > On Wed, May 17, 2023 at 12:41 AM Dave Hansen <dave.hansen@intel.com> wrote: >> Can't run arbitrary instructions, but can make (pretty) arbitrary syscalls? > > The threat model is that the attacker has arbitrary read/write, while other > threads run in parallel. So whenever a regular thread performs a syscall and > takes a syscall argument from memory, we assume that argument can be attacker > controlled. > Unfortunately, the line is a bit blurry which syscalls / syscall arguments we > need to assume to be attacker controlled. Ahh, OK. So, it's not that the *attacker* can make arbitrary syscalls. It's that the attacker might leverage its arbitrary write to trick a victim thread into turning what would otherwise be a good syscall into a bad one with attacker-controlled content. I guess that makes the readv/writev-style of things a bad idea in this environment. >>> Sigreturn is a separate problem that we hope to solve by adding pkey >>> support to sigaltstack >> >> What kind of support were you planning to add? > > We’d like to allow registering pkey-tagged memory as a sigaltstack. This would > allow the signal handler to run isolated from other threads. Right now, the > main reason this doesn’t work is that the kernel would need to change the pkru > state before storing the register state on the stack. > >> I was thinking that an attacker with arbitrary write access would wait >> until PKRU was on the userspace stack and *JUST* before the kernel >> sigreturn code restores it to write a malicious value. It could >> presumably do this with some asynchronous mechanism so that even if >> there was only one attacker thread, it could change its own value. > > I’m not sure I follow the details, can you give an example of an asynchronous > mechanism to do this? E.g. would this be the kernel writing to the memory in a > syscall for example? I was thinking of all of the IORING_OP_*'s that can write to memory or aio(7).
On Wed, May 17, 2023 at 8:07 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/17/23 03:51, Stephen Röttger wrote: > > On Wed, May 17, 2023 at 12:41 AM Dave Hansen <dave.hansen@intel.com> wrote: > >> Can't run arbitrary instructions, but can make (pretty) arbitrary syscalls? > > > > The threat model is that the attacker has arbitrary read/write, while other > > threads run in parallel. So whenever a regular thread performs a syscall and > > takes a syscall argument from memory, we assume that argument can be attacker > > controlled. > > Unfortunately, the line is a bit blurry which syscalls / syscall arguments we > > need to assume to be attacker controlled. > > Ahh, OK. So, it's not that the *attacker* can make arbitrary syscalls. > It's that the attacker might leverage its arbitrary write to trick a > victim thread into turning what would otherwise be a good syscall into a > bad one with attacker-controlled content. > > I guess that makes the readv/writev-style of things a bad idea in this > environment. > > >>> Sigreturn is a separate problem that we hope to solve by adding pkey > >>> support to sigaltstack > >> > >> What kind of support were you planning to add? > > > > We’d like to allow registering pkey-tagged memory as a sigaltstack. This would > > allow the signal handler to run isolated from other threads. Right now, the > > main reason this doesn’t work is that the kernel would need to change the pkru > > state before storing the register state on the stack. > > > >> I was thinking that an attacker with arbitrary write access would wait > >> until PKRU was on the userspace stack and *JUST* before the kernel > >> sigreturn code restores it to write a malicious value. It could > >> presumably do this with some asynchronous mechanism so that even if > >> there was only one attacker thread, it could change its own value. > > > > I’m not sure I follow the details, can you give an example of an asynchronous > > mechanism to do this? E.g. would this be the kernel writing to the memory in a > > syscall for example? > > I was thinking of all of the IORING_OP_*'s that can write to memory or > aio(7). IORING is challenging from security perspectives, for now, it is disabled in ChromeOS. Though I'm not sure how aio is related ? Thanks -Jeff
On 5/17/23 08:21, Jeff Xu wrote: >>> I’m not sure I follow the details, can you give an example of an asynchronous >>> mechanism to do this? E.g. would this be the kernel writing to the memory in a >>> syscall for example? >> I was thinking of all of the IORING_OP_*'s that can write to memory or >> aio(7). > IORING is challenging from security perspectives, for now, it is > disabled in ChromeOS. Though I'm not sure how aio is related ? Let's say you're the attacking thread and you're the *only* attacking thread. You have three things at your disposal: 1. A benign thread doing aio_read() 2. An arbitrary write primitive 3. You can send signals to yourself 4. You can calculate where your signal stack will be You calculate the address of PKRU on the future signal stack. You then leverage the otherwise benign aio_write() to write a 0 to that PKRU location. Then, send a signal to yourself. The attacker's PKRU value will be written to the stack. If you can time it right, the AIO will complete while the signal handler is in progress and PKRU is on the stack. On sigreturn, the kernel restores the aio_read()-placed, attacker-provided PKRU value. Now the attacker has PKRU==0. It effectively build a WRPKRU primitive out of those other pieces.
On Wed, May 17, 2023 at 8:29 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/17/23 08:21, Jeff Xu wrote: > >>> I’m not sure I follow the details, can you give an example of an asynchronous > >>> mechanism to do this? E.g. would this be the kernel writing to the memory in a > >>> syscall for example? > >> I was thinking of all of the IORING_OP_*'s that can write to memory or > >> aio(7). > > IORING is challenging from security perspectives, for now, it is > > disabled in ChromeOS. Though I'm not sure how aio is related ? > > Let's say you're the attacking thread and you're the *only* attacking > thread. You have three things at your disposal: > > 1. A benign thread doing aio_read() > 2. An arbitrary write primitive > 3. You can send signals to yourself > 4. You can calculate where your signal stack will be > > You calculate the address of PKRU on the future signal stack. You then > leverage the otherwise benign aio_write() to write a 0 to that PKRU > location. Then, send a signal to yourself. The attacker's PKRU value > will be written to the stack. If you can time it right, the AIO will > complete while the signal handler is in progress and PKRU is on the > stack. On sigreturn, the kernel restores the aio_read()-placed, > attacker-provided PKRU value. Now the attacker has PKRU==0. It > effectively build a WRPKRU primitive out of those other pieces. > > Ah, I understand the question now, thanks for the explanation. Signalling handling is the next project that I will be working on. I'm leaning towards saving PKRU register to the thread struct, similar to how context switch works. This will address the attack scenario you described. However, there are a few challenges I have not yet worked through. First, the code needs to track when the first signaling entry occurs (saving the PKRU register to the thread struct) and when it is last returned (restoring the PKRU register from the thread struct). One way to do this would be to add another member to the thread struct to track the level of signaling re-entry. Second, signal is used in error handling, including the kernel's own signaling handling code path. I haven't worked through this part of code logic completely. If the first approach is too complicated or considered intrusive, I could take a different approach. In this approach, I would not track signaling re-entry. Instead, I would modify the PKRU saved in AltStack during handling of the signal, the steps are: a> save PKRU to tmp variable. b> modify PKRU to allow writing to the PKEY protected AltStack c> XSAVE. d> write tmp to the memory address of PKRU in AltStack at the correct offset. Since the thread's PKRU is saved to stack, XRSTOR will restore the thread's original PKRU during sigreturn in normal situations. This approach might be a little hacky because it overwrites XSAVE results. If we go with this route, I need someone's help on the overwriting function, it is CPU specific. However this approach will not work if an attacker can install its own signaling handling (therefore gains the ability to overwrite PKRU stored in stack, as you described), the application will want to install all the signaling handling with PKEY protected AltStack at startup time, and disallow additional signaling handling after that, this is programmatically achievable in V8, as Stephan mentioned. I would appreciate getting more comments in the signaling handling area on those two approaches, or are there better ways to do what we want? Do you think we could continue signaling handling discussion from the original thread that Kees started [1] ? There were already lots of discussions there about signalling handling, so it will be easier for future readers to understand the context. I can repost there. Or I can start a new thread for signaling handling, I'm worried that those discussions will get lengthy and context get lost with patch version update. Although the signaling handling project is related, I think VMA protection using the PKRU project can stand on its own. We could solve this for V8 first then move next to Signaling handling, the work here could also pave the way to add mseal() in future, I expect lots of code logic will be similar. [1] https://lore.kernel.org/all/202208221331.71C50A6F@keescook/ Thanks! Best regards, -Jeff Xu On Wed, May 17, 2023 at 8:29 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/17/23 08:21, Jeff Xu wrote: > >>> I’m not sure I follow the details, can you give an example of an asynchronous > >>> mechanism to do this? E.g. would this be the kernel writing to the memory in a > >>> syscall for example? > >> I was thinking of all of the IORING_OP_*'s that can write to memory or > >> aio(7). > > IORING is challenging from security perspectives, for now, it is > > disabled in ChromeOS. Though I'm not sure how aio is related ? > > Let's say you're the attacking thread and you're the *only* attacking > thread. You have three things at your disposal: > > 1. A benign thread doing aio_read() > 2. An arbitrary write primitive > 3. You can send signals to yourself > 4. You can calculate where your signal stack will be > > You calculate the address of PKRU on the future signal stack. You then > leverage the otherwise benign aio_write() to write a 0 to that PKRU > location. Then, send a signal to yourself. The attacker's PKRU value > will be written to the stack. If you can time it right, the AIO will > complete while the signal handler is in progress and PKRU is on the > stack. On sigreturn, the kernel restores the aio_read()-placed, > attacker-provided PKRU value. Now the attacker has PKRU==0. It > effectively build a WRPKRU primitive out of those other pieces. > >
On 5/17/23 16:48, Jeff Xu wrote: > However, there are a few challenges I have not yet worked through. > First, the code needs to track when the first signaling entry occurs > (saving the PKRU register to the thread struct) and when it is last > returned (restoring the PKRU register from the thread struct). Would tracking signal "depth" work in the face of things like siglongjmp? Taking a step back... Here's my concern about this whole thing: it's headed down a rabbit hole which is *highly* specialized both in the apps that will use it and the attacks it will mitigate. It probably *requires* turning off a bunch of syscalls (like io_uring) that folks kinda like in general. We're balancing that highly specialized mitigation with a feature that add new ABI, touches core memory management code and signal handling. On the x86 side, PKRU is a painfully special snowflake. It's exposed in the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel. This would be making it an even more special snowflake because it would need new altstack ABI and handling. I'm just not sure the gain is worth the pain.
Hello Dave, Thanks for your email. On Thu, May 18, 2023 at 8:38 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/17/23 16:48, Jeff Xu wrote: > > However, there are a few challenges I have not yet worked through. > > First, the code needs to track when the first signaling entry occurs > > (saving the PKRU register to the thread struct) and when it is last > > returned (restoring the PKRU register from the thread struct). > > Would tracking signal "depth" work in the face of things like siglongjmp? > Thank you for your question! I am eager to learn more about this area and I worry about blind spots. I will investigate and get back to you. > Taking a step back... > > Here's my concern about this whole thing: it's headed down a rabbit hole > which is *highly* specialized both in the apps that will use it and the > attacks it will mitigate. It probably *requires* turning off a bunch of > syscalls (like io_uring) that folks kinda like in general. > ChromeOS currently disabled io_uring, but it is not required to do so. io_uring supports the IORING_OP_MADVICE operation, which calls the do_madvise() function. This means that io_uring will have the same pkey checks as the madvice() system call. From that perspective, we will fully support io_uring for this feature. > We're balancing that highly specialized mitigation with a feature that > add new ABI, touches core memory management code and signal handling. > The ABI change uses the existing flag field in pkey_alloc() which is reserved. The implementation is backward compatible with all existing pkey usages in both kernel and user space. Or do you have other concerns about ABI in mind ? Yes, you are right about the risk of touching core mm code. To minimize the risk, I try to control the scope of the change (it is about 3 lines in mprotect, more in munmap but really just 3 effective lines from syscall entry). I added new self-tests in mm to make sure it doesn't regress in api behavior. I run those tests before and after my kernel code change to make sure the behavior remains the same, I tested it on 5.15 and 6.1 and 6.4-rc1. Actually, the testing discovered a behavior change for mprotect() between 6.1 and 6.4 (not from this patch, there are refactoring works going on in mm) see this thread [1] I hope those steps will help to mitigate the risk. Agreed on signaling handling is a tough part: what do you think about the approach (modifying PKRU from saved stack after XSAVE), is there a blocker ? > On the x86 side, PKRU is a painfully special snowflake. It's exposed in > the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel. > This would be making it an even more special snowflake because it would I admit I'm quite ignorant on XSAVE to understand the above statement, and how that is related. Could you explain it to me please ? And what is in your mind that might improve the situation ? > need new altstack ABI and handling. > I thought adding protected memory support to signaling handling is an independent project with its own weight. As Jann Horn points out in [2]: "we could prevent the attacker from corrupting the signal context if we can protect the signal stack with a pkey." However, the kernel will send SIGSEGV when the stack is protected by PKEY, so there is a benefit to make this work. (Maybe Jann can share some more thoughts on the benefits) And I believe we could do this in a way with minimum ABI change, as below: - allocate PKEY with a new flag (PKEY_ALTSTACK) - at sigaltstack() call, detect the memory is PKEY_ALTSTACK protected, (similar as what mprotect does in this patch) and save it along with stack address/size. - at signaling handling, use the saved info to fill in PKRU. The ABI change is similar to PKEY_ENFORCE_API, and there is no backward compatibility issue. Will these mentioned help our case ? What do you think ? (Stephan has more info on gains, as far as I know, V8 engineers have worked/thought really hard to come to a suitable solution to make chrome browser safer) [1] https://lore.kernel.org/linux-mm/20230516165754.pocx4kaagn3yyw3r@revolver/T/ [2] https://docs.google.com/document/d/1OlnJbR5TMoaOAJsf4hHOc-FdTmYK2aDUI7d2hfCZSOo/edit?resourcekey=0-v9UJXONYsnG5PlCBbcYqIw# Thanks! Best regards, -Jeff
On 5/18/23 13:20, Jeff Xu wrote:>> Here's my concern about this whole thing: it's headed down a rabbit hole >> which is *highly* specialized both in the apps that will use it and the >> attacks it will mitigate. It probably *requires* turning off a bunch of >> syscalls (like io_uring) that folks kinda like in general. >> > ChromeOS currently disabled io_uring, but it is not required to do so. > io_uring supports the IORING_OP_MADVICE operation, which calls the > do_madvise() function. This means that io_uring will have the same > pkey checks as the madvice() system call. From that perspective, we > will fully support io_uring for this feature. io_uring fundamentally doesn't have the same checks. The kernel side work can be done from an asynchronous kernel thread. That kernel thread doesn't have a meaningful PKRU value. The register has a value, but it's not really related to the userspace threads that are sending it requests. >> We're balancing that highly specialized mitigation with a feature that >> add new ABI, touches core memory management code and signal handling. >> > The ABI change uses the existing flag field in pkey_alloc() which is > reserved. The implementation is backward compatible with all existing > pkey usages in both kernel and user space. Or do you have other > concerns about ABI in mind ? I'm not worried about the past, I'm worried any time we add a new ABI since we need to support it forever. > Yes, you are right about the risk of touching core mm code. To > minimize the risk, I try to control the scope of the change (it is > about 3 lines in mprotect, more in munmap but really just 3 effective > lines from syscall entry). I added new self-tests in mm to make sure > it doesn't regress in api behavior. I run those tests before and after > my kernel code change to make sure the behavior remains the same, I > tested it on 5.15 and 6.1 and 6.4-rc1. Actually, the testing > discovered a behavior change for mprotect() between 6.1 and 6.4 (not > from this patch, there are refactoring works going on in mm) see this > thread [1] > I hope those steps will help to mitigate the risk. > > Agreed on signaling handling is a tough part: what do you think about > the approach (modifying PKRU from saved stack after XSAVE), is there a > blocker ? Yes, signal entry and sigreturn are not necessarily symmetric so you can't really have a stack. >> On the x86 side, PKRU is a painfully special snowflake. It's exposed in >> the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel. >> This would be making it an even more special snowflake because it would > > I admit I'm quite ignorant on XSAVE to understand the above > statement, and how that is related. Could you explain it to me please > ? And what is in your mind that might improve the situation ? In a nutshell: XSAVE components are classified as either user or supervisor. User components can be modified from userspace and supervisor ones only from the kernel. In general, user components don't affect the kernel; the kernel doesn't care what is in ZMM11 (an XSAVE-managed register). That lets us do fun stuff like be lazy about when ZMM11 is saved/restored. Being lazy is good because it give us things like faster context switches and KVM VMEXIT handling. PKRU is a user component, but it affects the kernel when the kernel does copy_to/from_user() and friends. That means that the kernel can't do any "fun stuff" with PKRU. As soon as userspace provides a new value, the kernel needs to start respecting it. That makes PKRU a very special snowflake. So, even though PKRU can be managed by XSAVE, it isn't. It isn't kept in the kernel XSAVE buffer. But it *IS* in the signal stack XSAVE buffer. You *can* save/restore it with the other XSAVE components with ptrace(). The user<->kernel ABI pretends that PKRU is XSAVE managed even though it is not. All of this is special-cased. There's a ton of code to handle this mess. It's _complicated_. I haven't even started talking about how this interacts with KVM and guests. How could we improve it? A time machine would help to either change the architecture or have Linux ignore the fact that XSAVE knows anything about PKRU. So, the bar is pretty high for things that want to further muck with PKRU. Add signal and sigaltstack in particular into the fray, and we've got a recipe for disaster. sigaltstack and XSAVE don't really get along very well. https://lwn.net/Articles/862541/ >> need new altstack ABI and handling. >> > I thought adding protected memory support to signaling handling is an > independent project with its own weight. As Jann Horn points out in > [2]: "we could prevent the attacker from corrupting the signal > context if we can protect the signal stack with a pkey." However, > the kernel will send SIGSEGV when the stack is protected by PKEY, so > there is a benefit to make this work. (Maybe Jann can share some more > thoughts on the benefits) > > And I believe we could do this in a way with minimum ABI change, as below: > - allocate PKEY with a new flag (PKEY_ALTSTACK) > - at sigaltstack() call, detect the memory is PKEY_ALTSTACK protected, > (similar as what mprotect does in this patch) and save it along with > stack address/size. > - at signaling handling, use the saved info to fill in PKRU. > The ABI change is similar to PKEY_ENFORCE_API, and there is no > backward compatibility issue. > > Will these mentioned help our case ? What do you think ? To be honest, no. What you've laid out here is the tip of the complexity iceberg. There are a lot of pieces of the kernel that are not yet factored in. Let's also remember: protection keys is *NOT* a security feature. It's arguable that pkeys is a square peg trying to go into a round security hole.
On Thu, May 18, 2023 at 11:04 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/18/23 13:20, Jeff Xu wrote:>> Here's my concern about this whole > thing: it's headed down a rabbit hole > >> which is *highly* specialized both in the apps that will use it and the > >> attacks it will mitigate. It probably *requires* turning off a bunch of > >> syscalls (like io_uring) that folks kinda like in general. > >> > > ChromeOS currently disabled io_uring, but it is not required to do so. > > io_uring supports the IORING_OP_MADVICE operation, which calls the > > do_madvise() function. This means that io_uring will have the same > > pkey checks as the madvice() system call. From that perspective, we > > will fully support io_uring for this feature. > > io_uring fundamentally doesn't have the same checks. The kernel side > work can be done from an asynchronous kernel thread. That kernel thread > doesn't have a meaningful PKRU value. The register has a value, but > it's not really related to the userspace threads that are sending it > requests. > > >> We're balancing that highly specialized mitigation with a feature that > >> add new ABI, touches core memory management code and signal handling. > >> > > The ABI change uses the existing flag field in pkey_alloc() which is > > reserved. The implementation is backward compatible with all existing > > pkey usages in both kernel and user space. Or do you have other > > concerns about ABI in mind ? > > I'm not worried about the past, I'm worried any time we add a new ABI > since we need to support it forever. > > > Yes, you are right about the risk of touching core mm code. To > > minimize the risk, I try to control the scope of the change (it is > > about 3 lines in mprotect, more in munmap but really just 3 effective > > lines from syscall entry). I added new self-tests in mm to make sure > > it doesn't regress in api behavior. I run those tests before and after > > my kernel code change to make sure the behavior remains the same, I > > tested it on 5.15 and 6.1 and 6.4-rc1. Actually, the testing > > discovered a behavior change for mprotect() between 6.1 and 6.4 (not > > from this patch, there are refactoring works going on in mm) see this > > thread [1] > > I hope those steps will help to mitigate the risk. > > > > Agreed on signaling handling is a tough part: what do you think about > > the approach (modifying PKRU from saved stack after XSAVE), is there a > > blocker ? > > Yes, signal entry and sigreturn are not necessarily symmetric so you > can't really have a stack. > > >> On the x86 side, PKRU is a painfully special snowflake. It's exposed in > >> the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel. > >> This would be making it an even more special snowflake because it would > > > > I admit I'm quite ignorant on XSAVE to understand the above > > statement, and how that is related. Could you explain it to me please > > ? And what is in your mind that might improve the situation ? > > In a nutshell: XSAVE components are classified as either user or > supervisor. User components can be modified from userspace and > supervisor ones only from the kernel. In general, user components don't > affect the kernel; the kernel doesn't care what is in ZMM11 (an > XSAVE-managed register). That lets us do fun stuff like be lazy about > when ZMM11 is saved/restored. Being lazy is good because it give us > things like faster context switches and KVM VMEXIT handling. > > PKRU is a user component, but it affects the kernel when the kernel does > copy_to/from_user() and friends. That means that the kernel can't do > any "fun stuff" with PKRU. As soon as userspace provides a new value, > the kernel needs to start respecting it. That makes PKRU a very special > snowflake. > > So, even though PKRU can be managed by XSAVE, it isn't. It isn't kept > in the kernel XSAVE buffer. But it *IS* in the signal stack XSAVE > buffer. You *can* save/restore it with the other XSAVE components with > ptrace(). The user<->kernel ABI pretends that PKRU is XSAVE managed > even though it is not. > > All of this is special-cased. There's a ton of code to handle this > mess. It's _complicated_. I haven't even started talking about how > this interacts with KVM and guests. > > How could we improve it? A time machine would help to either change the > architecture or have Linux ignore the fact that XSAVE knows anything > about PKRU. > > So, the bar is pretty high for things that want to further muck with > PKRU. Add signal and sigaltstack in particular into the fray, and we've > got a recipe for disaster. sigaltstack and XSAVE don't really get along > very well. https://lwn.net/Articles/862541/ > > >> need new altstack ABI and handling. > >> > > I thought adding protected memory support to signaling handling is an > > independent project with its own weight. As Jann Horn points out in > > [2]: "we could prevent the attacker from corrupting the signal > > context if we can protect the signal stack with a pkey." However, > > the kernel will send SIGSEGV when the stack is protected by PKEY, so > > there is a benefit to make this work. (Maybe Jann can share some more > > thoughts on the benefits) > > > > And I believe we could do this in a way with minimum ABI change, as below: > > - allocate PKEY with a new flag (PKEY_ALTSTACK) > > - at sigaltstack() call, detect the memory is PKEY_ALTSTACK protected, > > (similar as what mprotect does in this patch) and save it along with > > stack address/size. > > - at signaling handling, use the saved info to fill in PKRU. > > The ABI change is similar to PKEY_ENFORCE_API, and there is no > > backward compatibility issue. > > > > Will these mentioned help our case ? What do you think ? > > To be honest, no. > > What you've laid out here is the tip of the complexity iceberg. There > are a lot of pieces of the kernel that are not yet factored in. > > Let's also remember: protection keys is *NOT* a security feature. It's > arguable that pkeys is a square peg trying to go into a round security hole. While they're not a security feature, they're pretty close to providing us with exactly what we need: per-thread memory permissions that we can use for in-process isolation. We've spent quite some effort up front thinking about potential attacks and we're confident we can build something that will pose a meaningful boundary. > On the x86 side, PKRU is a painfully special snowflake. It's exposed in > the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel. > This would be making it an even more special snowflake because it would > need new altstack ABI and handling. Most of the complexity in the signal handling proposal seems to come from the saving/restoring pkru before/after the signal handler execution. However, this is just nice to have. We just need the kernel to allow us to register pkey-tagged memory as a sigaltstack, i.e. it shouldn't crash when trying to write the register state to the stack. Everything else, we can do in userland. > It probably *requires* turning off a bunch of > syscalls (like io_uring) that folks kinda like in general. Kind of. This approach only works in combination with an effort in userland to restrict the syscalls. Though that doesn't mean you have to turn them off, there's also the option of adding validation before it. The same applies to the memory management syscalls in this patchset. We can add validation for these in userland, but we're hoping to do it in kernel instead for the reasons I mentioned before (e.g. they're very common and it's much easier to validate in the kernel). Also subjectively it seems like a nice property if the pkey protections would not just apply to the memory contents, but also apply to the metadata.
Thanks for bringing this to my attention. Regarding io_uring: > > io_uring fundamentally doesn't have the same checks. The kernel side > work can be done from an asynchronous kernel thread. That kernel thread > doesn't have a meaningful PKRU value. The register has a value, but > it's not really related to the userspace threads that are sending it > requests. > I asked the question to the io_uring list [1]. io_uring thread will respect PKRU of the user thread, async or not, the behavior is the same as regular syscall. There will be no issue for io_uring, i.e if it decides to add more memory mapping syscalls to supported cmd in future. [1] https://lore.kernel.org/io-uring/CABi2SkUp45HEt7eQ6a47Z7b3LzW=4m3xAakG35os7puCO2dkng@mail.gmail.com/ Thanks. -Jeff
Hi Dave, Regarding siglongjmp: On Thu, May 18, 2023 at 8:37 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/17/23 16:48, Jeff Xu wrote: > > However, there are a few challenges I have not yet worked through. > > First, the code needs to track when the first signaling entry occurs > > (saving the PKRU register to the thread struct) and when it is last > > returned (restoring the PKRU register from the thread struct). > > Would tracking signal "depth" work in the face of things like siglongjmp? > siglongjmp is interesting, thanks for bringing this up. With siglongjmp, the thread doesn't go back to the place where signal is raised, indeed, this idea of tracking the first signaling entry doesn't work well with siglongjmp. Thanks for your insight! -Jeff -Jeff
Hi Dave, Thanks for feedback, regarding sigaltstack: On Thu, May 18, 2023 at 2:04 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > > Agreed on signaling handling is a tough part: what do you think about > > the approach (modifying PKRU from saved stack after XSAVE), is there a > > blocker ? > > Yes, signal entry and sigreturn are not necessarily symmetric so you > can't really have a stack. > To clarify: I mean this option below: - before get_sigframe(), save PKUR => tmp - modify thread's PKRU so it can write to sigframe - XSAVE - save tmp => sigframe I believe you proposed this in a previous discussion [1]: and I quote here: "There's a delicate point when building the stack frame that the kernel would need to move over to the new PKRU value to build the frame before it writes the *OLD* value to the frame. But, it's far from impossible." sigreturn will restore thread's original PKRU from sigframe. In case of asymmetrics caused by siglongjmp, user space doesn't call sigreturn, the application needs to set desired PKRU before siglongjmp. I think this solution should work. [1] https://lore.kernel.org/lkml/b4f0dca5-1d15-67f7-4600-9a0a91e9d0bd@intel.com/ Best regards, -Jeff
On 5/31/23 18:39, Jeff Xu wrote:
> I think this solution should work.
By "work" I think you mean that if laser-focused on this one use case,
without a full implementation, it looks like it can work.
I'll give you a "maybe" on that.
But that leaves out the bigger picture. How many other things will we
regress doing this? What's the opportunity cost? What other things
will get neglected because we did _this_ one? Are there more users out
there?
Looking at the big picture, I'm not convinced those tradeoffs are good
ones (and you're not going to find anyone that's a bigger fan of pkeys
than me).
From: Jeff Xu <jeffxu@google.com> This is the first set of Memory mapping (VMA) protection patches using PKU. * * * Background: As discussed previously in the kernel mailing list [1], V8 CFI [2] uses PKU to protect memory, and Stephen Röttger proposes to extend the PKU to memory mapping [3]. We're using PKU for in-process isolation to enforce control-flow integrity for a JIT compiler. In our threat model, an attacker exploits a vulnerability and has arbitrary read/write access to the whole process space concurrently to other threads being executed. This attacker can manipulate some arguments to syscalls from some threads. Under such a powerful attack, we want to create a “safe/isolated” thread environment. We assign dedicated PKUs to this thread, and use those PKUs to protect the threads’ runtime environment. The thread has exclusive access to its run-time memory. This includes modifying the protection of the memory mapping, or munmap the memory mapping after use. And the other threads won’t be able to access the memory or modify the memory mapping (VMA) belonging to the thread. * * * Proposed changes: This patch introduces a new flag, PKEY_ENFORCE_API, to the pkey_alloc() function. When a PKEY is created with this flag, it is enforced that any thread that wants to make changes to the memory mapping (such as mprotect) of the memory must have write access to the PKEY. PKEYs created without this flag will continue to work as they do now, for backwards compatibility. Only PKEY created from user space can have the new flag set, the PKEY allocated by the kernel internally will not have it. In other words, ARCH_DEFAULT_PKEY(0) and execute_only_pkey won’t have this flag set, and continue work as today. This flag is checked only at syscall entry, such as mprotect/munmap in this set of patches. It will not apply to other call paths. In other words, if the kernel want to change attributes of VMA for some reasons, the kernel is free to do that and not affected by this new flag. This set of patch covers mprotect/munmap, I plan to work on other syscalls after this. * * * Testing: I have tested this patch on a Linux kernel 5.15, 6,1, and 6.4-rc1, new selftest is added in: pkey_enforce_api.c * * * Discussion: We believe that this patch provides a valuable security feature. It allows us to create “safe/isolated” thread environments that are protected from attackers with arbitrary read/write access to the process space. We believe that the interface change and the patch don't introduce backwards compatibility risk. We would like to disucss this patch in Linux kernel community for feedback and support. * * * Reference: [1]https://lore.kernel.org/all/202208221331.71C50A6F@keescook/ [2]https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit?usp=sharing [3]https://docs.google.com/document/d/1qqVoVfRiF2nRylL3yjZyCQvzQaej1HRPh3f5wj1AS9I/edit Best Regards, -Jeff Xu Jeff Xu (6): PKEY: Introduce PKEY_ENFORCE_API flag PKEY: Add arch_check_pkey_enforce_api() PKEY: Apply PKEY_ENFORCE_API to mprotect PKEY:selftest pkey_enforce_api for mprotect KEY: Apply PKEY_ENFORCE_API to munmap PKEY:selftest pkey_enforce_api for munmap arch/powerpc/include/asm/pkeys.h | 19 +- arch/x86/include/asm/mmu.h | 7 + arch/x86/include/asm/pkeys.h | 92 +- arch/x86/mm/pkeys.c | 2 +- include/linux/mm.h | 2 +- include/linux/pkeys.h | 18 +- include/uapi/linux/mman.h | 5 + mm/mmap.c | 34 +- mm/mprotect.c | 31 +- mm/mremap.c | 6 +- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/pkey_enforce_api.c | 1312 +++++++++++++++++ 12 files changed, 1507 insertions(+), 22 deletions(-) create mode 100644 tools/testing/selftests/mm/pkey_enforce_api.c base-commit: ba0ad6ed89fd5dada3b7b65ef2b08e95d449d4ab