Message ID | 20240311164638.2015063-12-pasha.tatashin@soleen.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Dynamic Kernel Stacks | expand |
On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote: > Add dynamic_stack_fault() calls to the kernel faults, and also declare > HAVE_ARCH_DYNAMIC_STACK = y, so that dynamic kernel stacks can be > enabled on x86 architecture. > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> > --- > arch/x86/Kconfig | 1 + > arch/x86/kernel/traps.c | 3 +++ > arch/x86/mm/fault.c | 3 +++ > 3 files changed, 7 insertions(+) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 5edec175b9bf..9bb0da3110fa 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -197,6 +197,7 @@ config X86 > select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD > select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD > select HAVE_ARCH_VMAP_STACK if X86_64 > + select HAVE_ARCH_DYNAMIC_STACK if X86_64 > select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET > select HAVE_ARCH_WITHIN_STACK_FRAMES > select HAVE_ASM_MODVERSIONS > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > index c3b2f863acf0..cc05401e729f 100644 > --- a/arch/x86/kernel/traps.c > +++ b/arch/x86/kernel/traps.c > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) > } > #endif > > + if (dynamic_stack_fault(current, address)) > + return; > + Sorry, but no, you can't necessarily do this. I say this as the person who write this code, and I justified my code on the basis that we are not recovering -- we're jumping out to a different context, and we won't crash if the origin context for the fault is corrupt. The SDM is really quite unambiguous about it: we're in an "abort" context, and returning is not allowed. And I this may well be is the real deal -- the microcode does not promise to have the return frame and the actual faulting context matched up here, and there's is no architectural guarantee that returning will do the right thing. Now we do have some history of getting a special exception, e.g. for espfix64. But espfix64 is a very special case, and the situation you're looking at is very general. So unless Intel and AMD are both wiling to publicly document that it's okay to handle stack overflow, where any instruction in the ISA may have caused the overflow, like this, then we're not going to do it. There are some other options: you could pre-map Also, I think the whole memory allocation concept in this whole series is a bit odd. Fundamentally, we *can't* block on these stack faults -- we may be in a context where blocking will deadlock. We may be in the page allocator. Panicing due to kernel stack allocation would be very unpleasant. But perhaps we could have a rule that a task can only be scheduled in if there is sufficient memory available for its stack. And perhaps we could avoid every page-faulting by filling in the PTEs for the potential stack pages but leaving them un-accessed. I *think* that all x86 implementations won't fill the TLB for a non-accessed page without also setting the accessed bit, so the performance hit of filling the PTEs, running the task, and then doing the appropriate synchronization to clear the PTEs and read the accessed bit on schedule-out to release the pages may not be too bad. But you would need to do this cautiously in the scheduler, possibly in the *next* task but before the prev task is actually released enough to be run on a different CPU. It's going to be messy.
On Mon, Mar 11, 2024 at 6:17 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote: > > Add dynamic_stack_fault() calls to the kernel faults, and also declare > > HAVE_ARCH_DYNAMIC_STACK = y, so that dynamic kernel stacks can be > > enabled on x86 architecture. > > > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> > > --- > > arch/x86/Kconfig | 1 + > > arch/x86/kernel/traps.c | 3 +++ > > arch/x86/mm/fault.c | 3 +++ > > 3 files changed, 7 insertions(+) > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > > index 5edec175b9bf..9bb0da3110fa 100644 > > --- a/arch/x86/Kconfig > > +++ b/arch/x86/Kconfig > > @@ -197,6 +197,7 @@ config X86 > > select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD > > select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD > > select HAVE_ARCH_VMAP_STACK if X86_64 > > + select HAVE_ARCH_DYNAMIC_STACK if X86_64 > > select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET > > select HAVE_ARCH_WITHIN_STACK_FRAMES > > select HAVE_ASM_MODVERSIONS > > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > > index c3b2f863acf0..cc05401e729f 100644 > > --- a/arch/x86/kernel/traps.c > > +++ b/arch/x86/kernel/traps.c > > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) > > } > > #endif > > > > + if (dynamic_stack_fault(current, address)) > > + return; > > + > > Sorry, but no, you can't necessarily do this. I say this as the person who write this code, and I justified my code on the basis that we are not recovering -- we're jumping out to a different context, and we won't crash if the origin context for the fault is corrupt. The SDM is really quite unambiguous about it: we're in an "abort" context, and returning is not allowed. And I this may well be is the real deal -- the microcode does not promise to have the return frame and the actual faulting context matched up here, and there's is no architectural guarantee that returning will do the right thing. > > Now we do have some history of getting a special exception, e.g. for espfix64. But espfix64 is a very special case, and the situation you're looking at is very general. So unless Intel and AMD are both wiling to publicly document that it's okay to handle stack overflow, where any instruction in the ISA may have caused the overflow, like this, then we're not going to do it. Hi Andy, Thank you for the insightful feedback. I'm somewhat confused about why we end up in exc_double_fault() in the first place. My initial assumption was that dynamic_stack_fault() would only be needed within do_kern_addr_fault(). However, while testing in QEMU, I found that when using memset() on a stack variable, code like this: rep stos %rax,%es:(%rdi) causes a double fault instead of a regular fault. I added it to exc_double_fault() as a result, but I'm curious if you have any insights into why this behavior occurs. > There are some other options: you could pre-map Pre-mapping would be expensive. It would mean pre-mapping the dynamic pages for every scheduled thread, and we'd still need to check the access bit every time a thread leaves the CPU. Dynamic thread faults should be considered rare events and thus shouldn't significantly affect the performance of normal context switch operations. With 8K stacks, we might encounter only 0.00001% of stacks requiring an extra page, and even fewer needing 16K. > Also, I think the whole memory allocation concept in this whole series is a bit odd. Fundamentally, we *can't* block on these stack faults -- we may be in a context where blocking will deadlock. We may be in the page allocator. Panicing due to kernel stack allocation would be very unpleasant. We never block during handling stack faults. There's a per-CPU page pool, guaranteeing availability for the faulting thread. The thread simply takes pages from this per-CPU data structure and refills the pool when leaving the CPU. The faulting routine is efficient, requiring a fixed number of loads without any locks, stalling, or even cmpxchg operations. > But perhaps we could have a rule that a task can only be scheduled in if there is sufficient memory available for its stack. Yes, I've considered this as well. We might implement this to avoid crashes due to page faults. Basically, if the per-CPU pool cannot be refilled, we'd prevent task scheduling until it is. We're already so short on memory that the kernel can't allocate up to 3 pages of memory. Thank you, Pasha > And perhaps we could avoid every page-faulting by filling in the PTEs for the potential stack pages but leaving them un-accessed. I *think* that all x86 implementations won't fill the TLB for a non-accessed page without also setting the accessed bit, so the performance hit of filling the PTEs, running the task, and then doing the appropriate synchronization to clear the PTEs and read the accessed bit on schedule-out to release the pages may not be too bad. But you would need to do this cautiously in the scheduler, possibly in the *next* task but before the prev task is actually released enough to be run on a different CPU. It's going to be messy.
On Mon, Mar 11 2024 at 19:10, Pasha Tatashin wrote: > On Mon, Mar 11, 2024 at 6:17 PM Andy Lutomirski <luto@kernel.org> wrote: >> Also, I think the whole memory allocation concept in this whole >> series is a bit odd. Fundamentally, we *can't* block on these stack >> faults -- we may be in a context where blocking will deadlock. We >> may be in the page allocator. Panicing due to kernel stack >> allocation would be very unpleasant. > > We never block during handling stack faults. There's a per-CPU page > pool, guaranteeing availability for the faulting thread. The thread > simply takes pages from this per-CPU data structure and refills the > pool when leaving the CPU. The faulting routine is efficient, > requiring a fixed number of loads without any locks, stalling, or even > cmpxchg operations. Is this true for any context including nested exceptions and #NMI? Thanks, tglx
On Mon, Mar 11, 2024, at 4:10 PM, Pasha Tatashin wrote: > On Mon, Mar 11, 2024 at 6:17 PM Andy Lutomirski <luto@kernel.org> wrote: >> >> >> >> On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote: >> > Add dynamic_stack_fault() calls to the kernel faults, and also declare >> > HAVE_ARCH_DYNAMIC_STACK = y, so that dynamic kernel stacks can be >> > enabled on x86 architecture. >> > >> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> >> > --- >> > arch/x86/Kconfig | 1 + >> > arch/x86/kernel/traps.c | 3 +++ >> > arch/x86/mm/fault.c | 3 +++ >> > 3 files changed, 7 insertions(+) >> > >> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> > index 5edec175b9bf..9bb0da3110fa 100644 >> > --- a/arch/x86/Kconfig >> > +++ b/arch/x86/Kconfig >> > @@ -197,6 +197,7 @@ config X86 >> > select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD >> > select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD >> > select HAVE_ARCH_VMAP_STACK if X86_64 >> > + select HAVE_ARCH_DYNAMIC_STACK if X86_64 >> > select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET >> > select HAVE_ARCH_WITHIN_STACK_FRAMES >> > select HAVE_ASM_MODVERSIONS >> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c >> > index c3b2f863acf0..cc05401e729f 100644 >> > --- a/arch/x86/kernel/traps.c >> > +++ b/arch/x86/kernel/traps.c >> > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) >> > } >> > #endif >> > >> > + if (dynamic_stack_fault(current, address)) >> > + return; >> > + >> >> Sorry, but no, you can't necessarily do this. I say this as the person who write this code, and I justified my code on the basis that we are not recovering -- we're jumping out to a different context, and we won't crash if the origin context for the fault is corrupt. The SDM is really quite unambiguous about it: we're in an "abort" context, and returning is not allowed. And I this may well be is the real deal -- the microcode does not promise to have the return frame and the actual faulting context matched up here, and there's is no architectural guarantee that returning will do the right thing. >> >> Now we do have some history of getting a special exception, e.g. for espfix64. But espfix64 is a very special case, and the situation you're looking at is very general. So unless Intel and AMD are both wiling to publicly document that it's okay to handle stack overflow, where any instruction in the ISA may have caused the overflow, like this, then we're not going to do it. > > Hi Andy, > > Thank you for the insightful feedback. > > I'm somewhat confused about why we end up in exc_double_fault() in the > first place. My initial assumption was that dynamic_stack_fault() > would only be needed within do_kern_addr_fault(). However, while > testing in QEMU, I found that when using memset() on a stack variable, > code like this: > > rep stos %rax,%es:(%rdi) > > causes a double fault instead of a regular fault. I added it to > exc_double_fault() as a result, but I'm curious if you have any > insights into why this behavior occurs. > Imagine you're a CPU running kernel code, on a fairly traditional architecture like x86. The code tries to access some swapped out user memory. You say "sorry, that memory is not present" and generate a page fault. You save the current state *to the stack* and chance the program counter to point to the page fault handler. The page fault handler does its thing, then pops the old state off the stack and resumes the faulting code. A few microseconds later, the kernel fills up its stack and then does: PUSH something but that would write to a not-present stack page, because you already filled the stack. Okay, a page fault -- no big deal, we know how to handle that. So you push the current state to the stack. On wait, you *can't* push the current state to the stack, because that would involve writing to an unmapped page of memory. So you trigger a double-fault. You push some state to the double-fault handler's special emergency stack. But wait, *what* state do you push? Is it the state that did the "PUSH something" and overflowed the stack? Or is some virtual state that's a mixture of that and the failed page fault handler? What if the stack wasn't quite full and you actually succeeded in pushing the old stack pointer but not the old program counter? What saved state goes where? This is a complicated mess, so the people who designed all this said 'hey, wait a minute, let's not call double faults a "fault" -- let's call them an "abort"' so we can stop confusing ourselves and ship CPUs to customers. And "abort" means "the saved state is not well defined -- don't rely on it having any particular meaning". So, until a few years ago, we would just print something like "PANIC: double fault" and kill the whole system. A few years ago, I decided this was lame, and I wanted to have stack guard pages, so i added real fancy new logic: instead, we do our best to display the old state, but it's a guess and all we're doing with it is printk -- if it's wrong, it's annoying, but that's all. And then we kill the running thread -- instead of trying to return (and violating our sacred contract with the x86 architecture), we *reset* the current crashing thread's state to a known-good state. Then we return to *that* state. Now we're off the emergency stack and we're running something resembling normal kernel code, but we can't return, as there is nowhere to return to. But that's fine -- instead we kill the current thread, kind of like _exit(). That never returns, so it's okay that we can't return. But your patch adds a return statement to this whole mess, which will return to the moderately-likely-to-be-corrupt state that caused a double fault inside the microcode for the page fault path, and you have stepped outside the well-defined path in the x86 architecture, and you've triggered something akin to Undefined Behavior. The CPU won't catch fire, but it reserves the right to execute from an incorrect RSP and/or RIP, to be in the middle of an instruction, etc. (For that matter, what if there was exactly enough room to enter the page fault handler, but the very first instruction of the page fault handler overflowed the stack? Then you allocate more memory, get lucky and successfully resume the page fault handler, and then promptly OOPS because you run the page fault handler and it thinks you got a kernel page fault? My OOPS code handles that, but, again, it's not trying to recover.) >> There are some other options: you could pre-map > > Pre-mapping would be expensive. It would mean pre-mapping the dynamic > pages for every scheduled thread, and we'd still need to check the > access bit every time a thread leaves the CPU. That's a write to four consecutive words in memory, with no locking required. > Dynamic thread faults > should be considered rare events and thus shouldn't significantly > affect the performance of normal context switch operations. With 8K > stacks, we might encounter only 0.00001% of stacks requiring an extra > page, and even fewer needing 16K. Well yes, but if you crash 0.0001% of the time due to the microcode not liking you, you lose. :) > >> Also, I think the whole memory allocation concept in this whole series is a bit odd. Fundamentally, we *can't* block on these stack faults -- we may be in a context where blocking will deadlock. We may be in the page allocator. Panicing due to kernel stack allocation would be very unpleasant. > > We never block during handling stack faults. There's a per-CPU page > pool, guaranteeing availability for the faulting thread. The thread > simply takes pages from this per-CPU data structure and refills the > pool when leaving the CPU. The faulting routine is efficient, > requiring a fixed number of loads without any locks, stalling, or even > cmpxchg operations. You can't block when scheduling, either. What if you can't refill the pool?
On 3/11/24 15:17, Andy Lutomirski wrote: > I *think* that all x86 implementations won't fill the TLB for a > non-accessed page without also setting the accessed bit, That's my understanding as well. The SDM is a little more obtuse about it: > Whenever the processor uses a paging-structure entry as part of > linear-address translation, it sets the accessed flag in that entry > (if it is not already set). but it's there. But if we start needing Accessed=1 to be accurate, clearing those PTEs gets more expensive because it needs to be atomic to lock out the page walker. It basically needs to start getting treated similarly to what is done for Dirty=1 on userspace PTEs. Not the end of the world, of course, but one more source of overhead.
On Mon, Mar 11, 2024, at 4:34 PM, Dave Hansen wrote: > On 3/11/24 15:17, Andy Lutomirski wrote: >> I *think* that all x86 implementations won't fill the TLB for a >> non-accessed page without also setting the accessed bit, > > That's my understanding as well. The SDM is a little more obtuse about it: > >> Whenever the processor uses a paging-structure entry as part of >> linear-address translation, it sets the accessed flag in that entry >> (if it is not already set). > > but it's there. > > But if we start needing Accessed=1 to be accurate, clearing those PTEs > gets more expensive because it needs to be atomic to lock out the page > walker. It basically needs to start getting treated similarly to what > is done for Dirty=1 on userspace PTEs. Not the end of the world, of > course, but one more source of overhead. In my fantasy land where I understand the x86 paging machinery, suppose we're in finish_task_switch(), and suppose prev is Not Horribly Buggy (TM). In particular, suppose that no other CPU is concurrently (non-speculatively!) accessing prev's stack. Prev can't be running, because whatever magic lock prevents it from being migrated hasn't been released yet. (I have no idea what lock this is, but it had darned well better exist so prev isn't migrated before switch_to() even returns.) So the current CPU is not accessing the memory, and no other CPU is accessing the memory, and BPF doesn't exist, so no one is being utterly daft and a kernel read probe, and perf isn't up to any funny business, etc. And a CPU will never *speculatively* set the accessed bit (I told you it's fantasy land), so we just do it unlocked: if (!pte->accessed) { *pte = 0; reuse the memory; } What could possibly go wrong? I admit this is not the best idea I've ever had, and I will not waste anyone's time by trying very hard to defend it :)
> On 12 Mar 2024, at 1:41, Andy Lutomirski <luto@kernel.org> wrote: > > On Mon, Mar 11, 2024, at 4:34 PM, Dave Hansen wrote: >> On 3/11/24 15:17, Andy Lutomirski wrote: >>> I *think* that all x86 implementations won't fill the TLB for a >>> non-accessed page without also setting the accessed bit, >> >> That's my understanding as well. The SDM is a little more obtuse about it: >> >>> Whenever the processor uses a paging-structure entry as part of >>> linear-address translation, it sets the accessed flag in that entry >>> (if it is not already set). >> >> but it's there. >> >> But if we start needing Accessed=1 to be accurate, clearing those PTEs >> gets more expensive because it needs to be atomic to lock out the page >> walker. It basically needs to start getting treated similarly to what >> is done for Dirty=1 on userspace PTEs. Not the end of the world, of >> course, but one more source of overhead. > > In my fantasy land where I understand the x86 paging machinery, suppose we're in finish_task_switch(), and suppose prev is Not Horribly Buggy (TM). In particular, suppose that no other CPU is concurrently (non-speculatively!) accessing prev's stack. Prev can't be running, because whatever magic lock prevents it from being migrated hasn't been released yet. (I have no idea what lock this is, but it had darned well better exist so prev isn't migrated before switch_to() even returns.) > > So the current CPU is not accessing the memory, and no other CPU is accessing the memory, and BPF doesn't exist, so no one is being utterly daft and a kernel read probe, and perf isn't up to any funny business, etc. And a CPU will never *speculatively* set the accessed bit (I told you it's fantasy land), so we just do it unlocked: > > if (!pte->accessed) { > *pte = 0; > reuse the memory; > } > > What could possibly go wrong? > > I admit this is not the best idea I've ever had, and I will not waste anyone's time by trying very hard to defend it :) > Just a thought: you don’t care if someone only reads from the stack's page (you can just install another page later). IOW: you only care if someone writes. So you can look on the dirty-bit, which is not being set speculatively and save yourself one problem.
On Mon, Mar 11, 2024, at 4:56 PM, Nadav Amit wrote: >> On 12 Mar 2024, at 1:41, Andy Lutomirski <luto@kernel.org> wrote: >> >> On Mon, Mar 11, 2024, at 4:34 PM, Dave Hansen wrote: >>> On 3/11/24 15:17, Andy Lutomirski wrote: >>>> I *think* that all x86 implementations won't fill the TLB for a >>>> non-accessed page without also setting the accessed bit, >>> >>> That's my understanding as well. The SDM is a little more obtuse about it: >>> >>>> Whenever the processor uses a paging-structure entry as part of >>>> linear-address translation, it sets the accessed flag in that entry >>>> (if it is not already set). >>> >>> but it's there. >>> >>> But if we start needing Accessed=1 to be accurate, clearing those PTEs >>> gets more expensive because it needs to be atomic to lock out the page >>> walker. It basically needs to start getting treated similarly to what >>> is done for Dirty=1 on userspace PTEs. Not the end of the world, of >>> course, but one more source of overhead. >> >> In my fantasy land where I understand the x86 paging machinery, suppose we're in finish_task_switch(), and suppose prev is Not Horribly Buggy (TM). In particular, suppose that no other CPU is concurrently (non-speculatively!) accessing prev's stack. Prev can't be running, because whatever magic lock prevents it from being migrated hasn't been released yet. (I have no idea what lock this is, but it had darned well better exist so prev isn't migrated before switch_to() even returns.) >> >> So the current CPU is not accessing the memory, and no other CPU is accessing the memory, and BPF doesn't exist, so no one is being utterly daft and a kernel read probe, and perf isn't up to any funny business, etc. And a CPU will never *speculatively* set the accessed bit (I told you it's fantasy land), so we just do it unlocked: >> >> if (!pte->accessed) { >> *pte = 0; >> reuse the memory; >> } >> >> What could possibly go wrong? >> >> I admit this is not the best idea I've ever had, and I will not waste anyone's time by trying very hard to defend it :) >> > > Just a thought: you don’t care if someone only reads from the stack's > page (you can just install another page later). IOW: you only care if > someone writes. > > So you can look on the dirty-bit, which is not being set speculatively > and save yourself one problem. Doesn't this buy a new problem? Install a page, run the thread without using the page but speculatively load the PTE as read-only into the TLB, context-switch out the thread, (entirely safely and correctly) determine that the page wasn't used, remove it from the PTE, use it for something else and fill it with things that aren't zero, run the thread again, and read from it. Now it has some other thread's data! One might slightly credibly argue that this isn't a problem -- between RSP and the bottom of the area that one nominally considers to the by the stack is allowed to return arbitrary garbage, especially in the kernel where there's no red zone (until someone builds a kernel with a redzone on a FRED system, hmm), but this is still really weird. If you *write* in that area, the CPU hopefully puts the *correct* value in the TLB and life goes on, but how much do you trust anyone to have validated what happens when a PTE is present, writable and clean but the TLB contains a stale entry pointing somewhere else? And is it really okay to do this to the poor kernel? If we're going to add a TLB flush on context switch, then (a) we are being rather silly and (b) we might as well just use atomics to play with the accessed bit instead, I think.
> >> There are some other options: you could pre-map > > > > Pre-mapping would be expensive. It would mean pre-mapping the dynamic > > pages for every scheduled thread, and we'd still need to check the > > access bit every time a thread leaves the CPU. > > That's a write to four consecutive words in memory, with no locking required. You convinced me, this might not be that bad. At the thread creation time we will save the locations of the unmapped thread PTE's, and set them on every schedule. There is a slight increase in scheduling cost, but perhaps it is not as bad as I initially thought. This approach, however, makes this dynamic stac feature much safer, and can be easily extended to all arches that support access/dirty bit tracking. > > > Dynamic thread faults > > should be considered rare events and thus shouldn't significantly > > affect the performance of normal context switch operations. With 8K > > stacks, we might encounter only 0.00001% of stacks requiring an extra > > page, and even fewer needing 16K. > > Well yes, but if you crash 0.0001% of the time due to the microcode not liking you, you lose. :) > > > > >> Also, I think the whole memory allocation concept in this whole series is a bit odd. Fundamentally, we *can't* block on these stack faults -- we may be in a context where blocking will deadlock. We may be in the page allocator. Panicing due to kernel stack allocation would be very unpleasant. > > > > We never block during handling stack faults. There's a per-CPU page > > pool, guaranteeing availability for the faulting thread. The thread > > simply takes pages from this per-CPU data structure and refills the > > pool when leaving the CPU. The faulting routine is efficient, > > requiring a fixed number of loads without any locks, stalling, or even > > cmpxchg operations. > > You can't block when scheduling, either. What if you can't refill the pool? Why can't we (I am not a scheduler guy)? IRQ's are not yet disabled, what prevents us from blocking while the old process has not yet been removed from the CPU?
> > > > You can't block when scheduling, either. What if you can't refill the pool? > > Why can't we (I am not a scheduler guy)? IRQ's are not yet disabled, > what prevents us from blocking while the old process has not yet been > removed from the CPU? Answering my own question: single cpu machine, the thread is going away, but tries to refill the dynamic-pages, alloc_page() goes into slow path i.e (__alloc_pages_slowpath), where it performs cond_resched() while waiting for oom kills, and yet we cannot leave the CPU.
On 3/11/24 16:56, Nadav Amit wrote: > So you can look on the dirty-bit, which is not being set > speculatively and save yourself one problem. Define "set speculatively". :) > If software on one logical processor writes to a page while software > on another logical processor concurrently clears the R/W flag in the > paging-structure entry that maps the page, execution on some > processors may result in the entry’s dirty flag being set (due to the > write on the first logical processor) and the entry’s R/W flag being > clear (due to the update to the entry on the second logical > processor). In other words, you'll see both a fault *AND* the dirty bit. The write never retired and the dirty bit is set. Does that count as being set speculatively? That's just the behavior that the SDM explicitly admits to.
On March 11, 2024 5:53:33 PM PDT, Dave Hansen <dave.hansen@intel.com> wrote: >On 3/11/24 16:56, Nadav Amit wrote: >> So you can look on the dirty-bit, which is not being set >> speculatively and save yourself one problem. >Define "set speculatively". :) > >> If software on one logical processor writes to a page while software >> on another logical processor concurrently clears the R/W flag in the >> paging-structure entry that maps the page, execution on some >> processors may result in the entry’s dirty flag being set (due to the >> write on the first logical processor) and the entry’s R/W flag being >> clear (due to the update to the entry on the second logical >> processor). > >In other words, you'll see both a fault *AND* the dirty bit. The write >never retired and the dirty bit is set. > >Does that count as being set speculatively? > >That's just the behavior that the SDM explicitly admits to. Indeed; both the A and D bits are by design permissive; that is, the hardware can set them at any time. The only guarantees are: 1. The hardware will not set the A bit on a not present late, nor the D bit on a read only page. 2. *Provided that the user has invalidated the page entry in the TLB*, hardware guarantees the respective bits will be set before a dependent memory access is made visible. Thus the bits are guaranteed to reflect a strict superset of operations performed architecturally.
On Mon, Mar 11, 2024, at 6:25 PM, H. Peter Anvin wrote: > On March 11, 2024 5:53:33 PM PDT, Dave Hansen <dave.hansen@intel.com> wrote: >>On 3/11/24 16:56, Nadav Amit wrote: >>> So you can look on the dirty-bit, which is not being set >>> speculatively and save yourself one problem. >>Define "set speculatively". :) >> >>> If software on one logical processor writes to a page while software >>> on another logical processor concurrently clears the R/W flag in the >>> paging-structure entry that maps the page, execution on some >>> processors may result in the entry’s dirty flag being set (due to the >>> write on the first logical processor) and the entry’s R/W flag being >>> clear (due to the update to the entry on the second logical >>> processor). >> >>In other words, you'll see both a fault *AND* the dirty bit. The write >>never retired and the dirty bit is set. >> >>Does that count as being set speculatively? >> >>That's just the behavior that the SDM explicitly admits to. > > Indeed; both the A and D bits are by design permissive; that is, the > hardware can set them at any time. > > The only guarantees are: > > 1. The hardware will not set the A bit on a not present late, nor the D > bit on a read only page. Wait a sec. What about setting the D bit on a not-present page? I always assumed that the actual intended purpose of the D bit was for things like file mapping. Imagine an alternate universe in which Linux used hardware dirty tracking instead of relying on do_wp_page, etc. mmap(..., MAP_SHARED): PTE is created, read-write, clean user program may or may not write to the page. Now either munmap is called or the kernel needs to reclaim memory. So the kernel checks if the page is dirty and, if so, writes it back, and then unmaps it. Now some silly people invented SMP, so this needs an atomic operation: xchg the PTE to all-zeros, see if the dirty bit is set, and, if itt's set, write back the page. Otherwise discard it. Does this really not work on Intel CPU?
On March 11, 2024 7:16:38 PM PDT, Andy Lutomirski <luto@kernel.org> wrote: >On Mon, Mar 11, 2024, at 6:25 PM, H. Peter Anvin wrote: >> On March 11, 2024 5:53:33 PM PDT, Dave Hansen <dave.hansen@intel.com> wrote: >>>On 3/11/24 16:56, Nadav Amit wrote: >>>> So you can look on the dirty-bit, which is not being set >>>> speculatively and save yourself one problem. >>>Define "set speculatively". :) >>> >>>> If software on one logical processor writes to a page while software >>>> on another logical processor concurrently clears the R/W flag in the >>>> paging-structure entry that maps the page, execution on some >>>> processors may result in the entry’s dirty flag being set (due to the >>>> write on the first logical processor) and the entry’s R/W flag being >>>> clear (due to the update to the entry on the second logical >>>> processor). >>> >>>In other words, you'll see both a fault *AND* the dirty bit. The write >>>never retired and the dirty bit is set. >>> >>>Does that count as being set speculatively? >>> >>>That's just the behavior that the SDM explicitly admits to. >> >> Indeed; both the A and D bits are by design permissive; that is, the >> hardware can set them at any time. >> >> The only guarantees are: >> >> 1. The hardware will not set the A bit on a not present late, nor the D >> bit on a read only page. > >Wait a sec. What about setting the D bit on a not-present page? > >I always assumed that the actual intended purpose of the D bit was for things like file mapping. Imagine an alternate universe in which Linux used hardware dirty tracking instead of relying on do_wp_page, etc. > >mmap(..., MAP_SHARED): PTE is created, read-write, clean > >user program may or may not write to the page. > >Now either munmap is called or the kernel needs to reclaim memory. So the kernel checks if the page is dirty and, if so, writes it back, and then unmaps it. > >Now some silly people invented SMP, so this needs an atomic operation: xchg the PTE to all-zeros, see if the dirty bit is set, and, if itt's set, write back the page. Otherwise discard it. > >Does this really not work on Intel CPU? > Sorry, I should have been more clear. Hardware will not set a bit that would correspond to a prohibited access.
> On 12 Mar 2024, at 2:02, Andy Lutomirski <luto@kernel.org> wrote: > > Doesn't this buy a new problem? Install a page, run the thread without using the page but speculatively load the PTE as read-only into the TLB, context-switch out the thread, (entirely safely and correctly) determine that the page wasn't used, remove it from the PTE, use it for something else and fill it with things that aren't zero, run the thread again, and read from it. Now it has some other thread's data! Yes, you are correct. Bad idea of mine. Regardless of data leak, it opens the door for subtle hard-to-analyze bugs where 2 reads return different values.
Pasha Tatashin <pasha.tatashin@soleen.com> writes: > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index d6375b3c633b..651c558b10eb 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, > if (is_f00f_bug(regs, hw_error_code, address)) > return; > > + if (dynamic_stack_fault(current, address)) > + return; Probably I'm missing something here, but since you are on the same stack as you are trying to fix up, how can this possibly work? Fault on missing stack #PF <will push more stuff onto the missing stack causing a double fault> You would need a separate stack just to handle this. But the normal page fault handler couldn't use it because it needs to be able to block. Ah I get it -- you handle it in the double fault handler? So every stack grow will be a #DF too? That's scary. -Andi
On Mon, Mar 11 2024 at 16:46, Pasha Tatashin wrote: > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) > } > #endif > > + if (dynamic_stack_fault(current, address)) > + return; > + > irqentry_nmi_enter(regs); > instrumentation_begin(); > notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV); > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index d6375b3c633b..651c558b10eb 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, > if (is_f00f_bug(regs, hw_error_code, address)) > return; > > + if (dynamic_stack_fault(current, address)) > + return; T1 schedules out with stack used close to the fault boundary. switch_to(T2) Now T1 schedules back in switch_to(T1) __switch_to_asm() ... switch_stacks() <- SP on T1 stack ! ... ! jmp __switch_to() ! __switch_to() ! ... ! raw_cpu_write(pcpu_hot.current_task, next_p); After switching SP to T1's stack and up to the point where pcpu_hot.current_task (aka current) is updated to T1 a stack fault will invoke dynamic_stack_fault(T2, address) which will return false here: /* check if address is inside the kernel stack area */ stack = (unsigned long)tsk->stack; if (address < stack || address >= stack + THREAD_SIZE) return false; because T2's stack does obviously not cover the faulting address on T1's stack. As a consequence double fault will panic the machine. Thanks, tglx
On Wed, Mar 13, 2024 at 6:23 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Mon, Mar 11 2024 at 16:46, Pasha Tatashin wrote: > > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) > > } > > #endif > > > > + if (dynamic_stack_fault(current, address)) > > + return; > > + > > irqentry_nmi_enter(regs); > > instrumentation_begin(); > > notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV); > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index d6375b3c633b..651c558b10eb 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, > > if (is_f00f_bug(regs, hw_error_code, address)) > > return; > > > > + if (dynamic_stack_fault(current, address)) > > + return; > > T1 schedules out with stack used close to the fault boundary. > > switch_to(T2) > > Now T1 schedules back in > > switch_to(T1) > __switch_to_asm() > ... > switch_stacks() <- SP on T1 stack > ! ... > ! jmp __switch_to() > ! __switch_to() > ! ... > ! raw_cpu_write(pcpu_hot.current_task, next_p); > > After switching SP to T1's stack and up to the point where > pcpu_hot.current_task (aka current) is updated to T1 a stack fault will > invoke dynamic_stack_fault(T2, address) which will return false here: > > /* check if address is inside the kernel stack area */ > stack = (unsigned long)tsk->stack; > if (address < stack || address >= stack + THREAD_SIZE) > return false; > > because T2's stack does obviously not cover the faulting address on T1's > stack. As a consequence double fault will panic the machine. Hi Thomas, Thank you, you are absolutely right, we can't trust "current" in the fault handler. We can change dynamic_stack_fault() to only accept fault_address as an argument, and let it determine the right task_struct pointer internally. Let's modify dynamic_stack_fault() to accept only the fault_address. It can then determine the correct task_struct pointer internally. Here's a potential solution that is fast, avoids locking, and ensures atomicity: 1. Kernel Stack VA Space Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA]) exclusively for kernel stacks. This simplifies validation of faulting addresses to be part of a stack. 2. Finding the faulty task - Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the topmost stack page (since stack addresses are aligned to THREAD_SIZE). - Store the task_struct pointer as the last word on this topmost page, that is always present as it is a pre-allcated stack page. 3. Stack Padding Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8) to accommodate the task_struct pointer. Another issue that this race brings is that 3-pages per-cpu might not be enough, we might need up-to 6 pages: 3 to cover going-away task, and 3 to cover the new task. Pasha
On Wed, Mar 13, 2024 at 9:43 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Wed, Mar 13, 2024 at 6:23 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > On Mon, Mar 11 2024 at 16:46, Pasha Tatashin wrote: > > > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) > > > } > > > #endif > > > > > > + if (dynamic_stack_fault(current, address)) > > > + return; > > > + > > > irqentry_nmi_enter(regs); > > > instrumentation_begin(); > > > notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV); > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > > index d6375b3c633b..651c558b10eb 100644 > > > --- a/arch/x86/mm/fault.c > > > +++ b/arch/x86/mm/fault.c > > > @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, > > > if (is_f00f_bug(regs, hw_error_code, address)) > > > return; > > > > > > + if (dynamic_stack_fault(current, address)) > > > + return; > > > > T1 schedules out with stack used close to the fault boundary. > > > > switch_to(T2) > > > > Now T1 schedules back in > > > > switch_to(T1) > > __switch_to_asm() > > ... > > switch_stacks() <- SP on T1 stack > > ! ... > > ! jmp __switch_to() > > ! __switch_to() > > ! ... > > ! raw_cpu_write(pcpu_hot.current_task, next_p); > > > > After switching SP to T1's stack and up to the point where > > pcpu_hot.current_task (aka current) is updated to T1 a stack fault will > > invoke dynamic_stack_fault(T2, address) which will return false here: > > > > /* check if address is inside the kernel stack area */ > > stack = (unsigned long)tsk->stack; > > if (address < stack || address >= stack + THREAD_SIZE) > > return false; > > > > because T2's stack does obviously not cover the faulting address on T1's > > stack. As a consequence double fault will panic the machine. > > Hi Thomas, > > Thank you, you are absolutely right, we can't trust "current" in the > fault handler. > > We can change dynamic_stack_fault() to only accept fault_address as an > argument, and let it determine the right task_struct pointer > internally. > > Let's modify dynamic_stack_fault() to accept only the fault_address. > It can then determine the correct task_struct pointer internally. > > Here's a potential solution that is fast, avoids locking, and ensures atomicity: > > 1. Kernel Stack VA Space > Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA]) > exclusively for kernel stacks. This simplifies validation of faulting > addresses to be part of a stack. > > 2. Finding the faulty task > - Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the > topmost stack page (since stack addresses are aligned to THREAD_SIZE). > - Store the task_struct pointer as the last word on this topmost page, > that is always present as it is a pre-allcated stack page. > > 3. Stack Padding > Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8) > to accommodate the task_struct pointer. Alternatively, do not even look-up the task_struct in dynamic_stack_fault(), but only install the mapping to the faulting address, store va in the per-cpu array, and handle the rest in dynamic_stack() during context switching. At that time spin locks can be taken, and we can do a find_vm_area(addr) call. This way, we would not need to modify TOP_OF_KERNEL_STACK_PADDING to keep task_struct in there. Pasha
On Wed, Mar 13 2024 at 11:28, Pasha Tatashin wrote: > On Wed, Mar 13, 2024 at 9:43 AM Pasha Tatashin > <pasha.tatashin@soleen.com> wrote: >> Here's a potential solution that is fast, avoids locking, and ensures atomicity: >> >> 1. Kernel Stack VA Space >> Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA]) >> exclusively for kernel stacks. This simplifies validation of faulting >> addresses to be part of a stack. >> >> 2. Finding the faulty task >> - Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the >> topmost stack page (since stack addresses are aligned to THREAD_SIZE). >> - Store the task_struct pointer as the last word on this topmost page, >> that is always present as it is a pre-allcated stack page. >> >> 3. Stack Padding >> Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8) >> to accommodate the task_struct pointer. > > Alternatively, do not even look-up the task_struct in > dynamic_stack_fault(), but only install the mapping to the faulting > address, store va in the per-cpu array, and handle the rest in > dynamic_stack() during context switching. At that time spin locks can > be taken, and we can do a find_vm_area(addr) call. > > This way, we would not need to modify TOP_OF_KERNEL_STACK_PADDING to > keep task_struct in there. Why not simply doing the 'current' update right next to the stack switching in __switch_to_asm() which has no way of faulting. That needs to validate whether anything uses current between the stack switch and the place where current is updated today. I think nothing should do so, but I would not be surprised either if it would be the case. Such code would already today just work by chance I think, That should not be hard to analyze and fixup if necessary. So that's fixable, but I'm not really convinced that all of this is safe and correct under all circumstances. That needs a lot more analysis than just the trivial one I did for switch_to(). Thanks, tglx
On Wed, Mar 13, 2024 at 12:12 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Wed, Mar 13 2024 at 11:28, Pasha Tatashin wrote: > > On Wed, Mar 13, 2024 at 9:43 AM Pasha Tatashin > > <pasha.tatashin@soleen.com> wrote: > >> Here's a potential solution that is fast, avoids locking, and ensures atomicity: > >> > >> 1. Kernel Stack VA Space > >> Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA]) > >> exclusively for kernel stacks. This simplifies validation of faulting > >> addresses to be part of a stack. > >> > >> 2. Finding the faulty task > >> - Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the > >> topmost stack page (since stack addresses are aligned to THREAD_SIZE). > >> - Store the task_struct pointer as the last word on this topmost page, > >> that is always present as it is a pre-allcated stack page. > >> > >> 3. Stack Padding > >> Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8) > >> to accommodate the task_struct pointer. > > > > Alternatively, do not even look-up the task_struct in > > dynamic_stack_fault(), but only install the mapping to the faulting > > address, store va in the per-cpu array, and handle the rest in > > dynamic_stack() during context switching. At that time spin locks can > > be taken, and we can do a find_vm_area(addr) call. > > > > This way, we would not need to modify TOP_OF_KERNEL_STACK_PADDING to > > keep task_struct in there. > > Why not simply doing the 'current' update right next to the stack > switching in __switch_to_asm() which has no way of faulting. > > That needs to validate whether anything uses current between the stack > switch and the place where current is updated today. I think nothing > should do so, but I would not be surprised either if it would be the > case. Such code would already today just work by chance I think, > > That should not be hard to analyze and fixup if necessary. > > So that's fixable, but I'm not really convinced that all of this is safe > and correct under all circumstances. That needs a lot more analysis than > just the trivial one I did for switch_to(). Agreed, if the current task pointer can be switched later, after loads and stores to the stack, that would be a better solution. I will incorporate this approach into my next version. I also concur that this proposal necessitates more rigorous analysis. This work remains in the investigative phase, where I am seeking a viable solution to the problem. The core issue is that kernel stacks consume excessive memory for certain workloads. However, we cannot simply reduce their size, as this leads to machine crashes in the infrequent instances where stacks do run deep. Thanks, Pasha
On Thu, Mar 14 2024 at 10:03, Pasha Tatashin wrote: > On Wed, Mar 13, 2024 at 12:12 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> That needs to validate whether anything uses current between the stack >> switch and the place where current is updated today. I think nothing >> should do so, but I would not be surprised either if it would be the >> case. Such code would already today just work by chance I think, >> >> That should not be hard to analyze and fixup if necessary. >> >> So that's fixable, but I'm not really convinced that all of this is safe >> and correct under all circumstances. That needs a lot more analysis than >> just the trivial one I did for switch_to(). > > Agreed, if the current task pointer can be switched later, after loads > and stores to the stack, that would be a better solution. I will > incorporate this approach into my next version. No. You need to ensure that there is neither a load or store on the stack between: movq %rsp, TASK_threadsp(%rdi) movq TASK_threadsp(%rsi), %rsp and update_current(). IOW, you need to move the update of pcpu_hot.current to ASM right after the RSP switch. > I also concur that this proposal necessitates more rigorous analysis. Glad we agree here :) Thanks, tglx
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5edec175b9bf..9bb0da3110fa 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -197,6 +197,7 @@ config X86 select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD select HAVE_ARCH_VMAP_STACK if X86_64 + select HAVE_ARCH_DYNAMIC_STACK if X86_64 select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET select HAVE_ARCH_WITHIN_STACK_FRAMES select HAVE_ASM_MODVERSIONS diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index c3b2f863acf0..cc05401e729f 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) } #endif + if (dynamic_stack_fault(current, address)) + return; + irqentry_nmi_enter(regs); instrumentation_begin(); notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV); diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index d6375b3c633b..651c558b10eb 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, if (is_f00f_bug(regs, hw_error_code, address)) return; + if (dynamic_stack_fault(current, address)) + return; + /* Was the fault spurious, caused by lazy TLB invalidation? */ if (spurious_kernel_fault(hw_error_code, address)) return;
Add dynamic_stack_fault() calls to the kernel faults, and also declare HAVE_ARCH_DYNAMIC_STACK = y, so that dynamic kernel stacks can be enabled on x86 architecture. Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> --- arch/x86/Kconfig | 1 + arch/x86/kernel/traps.c | 3 +++ arch/x86/mm/fault.c | 3 +++ 3 files changed, 7 insertions(+)