Message ID | 20240531090331.13713-1-weijiang.yang@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce CET supervisor state support | expand |
Hi, maintainers, I recently did some tests for this series, the benchmark is here: https://github.com/antonblanchard/will-it-scale/blob/master/tests/context_switch1.c The purpose is to check what's the overall performance impact in thread/process context- switch when CET supervisor state bit, i.e., RFBM[12], is fixed with 0 or 1 in xsaves/xrstors. 3 cases are tested: case 1: stock v6.10-rc1 kernel. case 2: v6.10-rc1 kernel + this patch series so that RFBM[12] == 0 for normal thread/process. case 3: v6.10-rc1 kernel + this patch series and with patch 3(Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set) reverted so that RFBM[12] == 1 for normal thread process. Run below command 10 times in each case: ./context_switch1_processes -s 20 -t 50 -n Trim the results by removing the top and down 10% of the data in each case, and I got below numbers: case 1:16444675.45Set as 1 case 2:16412285.61~99.80% case 3:16405716.19~99.76% As you can see from the results, in case 2 with optimization based on XFEATURE_MASK_KERNEL_DYNAMIC, the performance gain is ~0.2%. So I'm not sure whether XFEATURE_MASK_KERNEL_DYNAMIC and related changes are worth or not for this series. Could you share your thoughts? Thanks a lot!
On 7/8/24 20:17, Yang, Weijiang wrote: > So I'm not sure whether XFEATURE_MASK_KERNEL_DYNAMIC and related changes > are worth or not for this series. > > Could you share your thoughts? First of all, I really do appreciate when folks make the effort to _try_ to draw their own conclusions before asking the maintainers to share theirs. Next time, OK? ;) But here goes. So we've basically got three cases. Here's a fancy table: > https://docs.google.com/spreadsheets/d/e/2PACX-1vROHIgrtHzUJmdlzT7D7tuVzgM8AMlK2XlorvFIJvk-I0NjD7A-T_qntjz7cUJlCScfWGtSfPK30Xtu/pubhtml ... and the same in ASCII Case |IA32_XSS[12] | Space | RFBM[12] | Drop% -----+-------------+-------+----------+------ 1 | 0 | None | 0 | 0.0% 2 | 1 | None | 0 | 0.2% 3 | 1 | 24B? | 1 | 0.2% Case 1 is the baseline of course. Case 2 avoids allocating space for CET and also leans on the kernel to set RFBM[12]==0 and tell the hardware not to write CET-S state. Case 3 wastes the CET-S space in each task and also leans on the hardware init optimization to avoid writing out CET-S space on each XSAVES. #1 is: 0 lines of code. #2 is: 5 files changed, 90 insertions(+), 27 deletions(-) #3 is: very few lines of code, nearing zero #2 and #3 have the same performance. So we're down to choosing between * $BYTES space in 'struct fpu' (on hardware supporting CET-S) or * ~100 loc $BYTES is 24, right? Did I get anything wrong? So, here's my stake in the ground: I think the 100 lines of code is probably worth it. But I also hate complicating the FPU code, so I'm also somewhat drawn to just eating the 24 bytes and moving on. But I'm still in the "case 2" camp. Anybody disagree?
On Thu, 2024-07-11 at 13:58 -0700, Dave Hansen wrote: > So we're down to choosing between > > * $BYTES space in 'struct fpu' (on hardware supporting CET-S) > > or > > * ~100 loc > > $BYTES is 24, right? Did I get anything wrong? Do we know what the actual memory use is? It would increases the size asked of of the allocator by 24 bytes, but what amount of memory actually gets reserved? It is sometimes a slab allocated buffer, and sometimes a vmalloc, right? I'm not sure about slab sizes, but for vmalloc if the increase doesn't cross a page size, it will be the same size allocation in reality. Or if it is close to a page size already, it might use a whole extra 4096 bytes. So we might be looking at a situation where some tasks get an entire extra page allocated per task, and some get no difference. And only the average is 24 bytes increase. Hmm, I was trying to reason out something to use for tie breaking. Not sure how successful it was. The worst case of a whole page size sounds like something someone might care about?
On 7/11/24 15:11, Edgecombe, Rick P wrote: > On Thu, 2024-07-11 at 13:58 -0700, Dave Hansen wrote: >> So we're down to choosing between >> >> * $BYTES space in 'struct fpu' (on hardware supporting CET-S) >> >> or >> >> * ~100 loc >> >> $BYTES is 24, right? Did I get anything wrong? > > Do we know what the actual memory use is? It would increases the size asked of > of the allocator by 24 bytes, but what amount of memory actually gets reserved? > > It is sometimes a slab allocated buffer, and sometimes a vmalloc, right? I'm not > sure about slab sizes, but for vmalloc if the increase doesn't cross a page > size, it will be the same size allocation in reality. Or if it is close to a > page size already, it might use a whole extra 4096 bytes. Man, I hope I don't have this all mixed up in my head. Wouldn't be the first time. I _think_ you might be confusing thread_info and thread_struct, though. I know I've gotten them confused before. But we get to the 'struct fpu' via: current->thread.fpu Where current is a 'task_struct' which is in /proc/slabinfo and 'struct thread_struct thread' and 'struct fpu' are embedded in 'task_struct', not allocated on their own: task_struct 2958 3018 10048 3 8 ... So my current task_struct is 10048 bytes and 3 of them fit in each 8-page slab, leaving 2624 bytes to spare. I don't think we're too dainty about adding thing to task_struct. Are we? > So we might be looking at a situation where some tasks get an entire extra page > allocated per task, and some get no difference. And only the average is 24 bytes > increase. I think you're right here, at least when it comes to large weirdly-sized slabs. But _so_ many things affect task_struct that I've never seen anyone sweat it too much.
On Thu, 2024-07-11 at 15:30 -0700, Dave Hansen wrote: > > > $BYTES is 24, right? Did I get anything wrong? > > > > Do we know what the actual memory use is? It would increases the size asked > > of > > of the allocator by 24 bytes, but what amount of memory actually gets > > reserved? > > > > It is sometimes a slab allocated buffer, and sometimes a vmalloc, right? I'm > > not > > sure about slab sizes, but for vmalloc if the increase doesn't cross a page > > size, it will be the same size allocation in reality. Or if it is close to a > > page size already, it might use a whole extra 4096 bytes. > > Man, I hope I don't have this all mixed up in my head. Wouldn't be the > first time. I _think_ you might be confusing thread_info and > thread_struct, though. I know I've gotten them confused before. > > But we get to the 'struct fpu' via: > > current->thread.fpu > > Where current is a 'task_struct' which is in /proc/slabinfo and 'struct > thread_struct thread' and 'struct fpu' are embedded in 'task_struct', > not allocated on their own: I think thread_struct is always a slab, but the current->thread.fpu.fpstate pointer can be reallocated to point to a vmalloc in fpstate_realloc(), in the case of XFD features. > > task_struct 2958 3018 10048 3 8 ... > > So my current task_struct is 10048 bytes and 3 of them fit in each > 8-page slab, leaving 2624 bytes to spare. > > I don't think we're too dainty about adding thing to task_struct. Are we? So for you there would actually not be any extra memory usage to unconditionally add 24 bytes to the xstate. But, yes, it all could change for a number of reasons. > > > So we might be looking at a situation where some tasks get an entire extra > > page > > allocated per task, and some get no difference. And only the average is 24 > > bytes > > increase. > > I think you're right here, at least when it comes to large weirdly-sized > slabs. But _so_ many things affect task_struct that I've never seen > anyone sweat it too much. Makes sense. Then I can't think of any argument to move from case 2.
On 7/12/2024 4:58 AM, Dave Hansen wrote: > On 7/8/24 20:17, Yang, Weijiang wrote: >> So I'm not sure whether XFEATURE_MASK_KERNEL_DYNAMIC and related changes >> are worth or not for this series. >> >> Could you share your thoughts? > First of all, I really do appreciate when folks make the effort to _try_ > to draw their own conclusions before asking the maintainers to share > theirs. Next time, OK? ;) Hi, Dave, Sorry for not doing that and thanks for making the conclusion clear! I personally prefer applying the whole series so as to eliminates storage space issue and make guest fpu config on its own settings. But also not sure the changes are worthwhile from kernel's point of view.
On 7/11/24 15:55, Edgecombe, Rick P wrote: >> Where current is a 'task_struct' which is in /proc/slabinfo and 'struct >> thread_struct thread' and 'struct fpu' are embedded in 'task_struct', >> not allocated on their own: > I think thread_struct is always a slab, but the current->thread.fpu.fpstate > pointer can be reallocated to point to a vmalloc in fpstate_realloc(), in the > case of XFD features. Good point. I was ignoring XFD and AMX. They're super rare and (conditionally) add another 8k. -- Well, closer to 11k since we duplicate some of the XSAVE area. -- But honestly, even if the AMX 'struct fpu' fit perfectly into 4k*3 pages and CET-S made it take 4k*4 pages, I'm not sure I'd even care. It would only affect AMX-using apps on AMX-capable hardware. So a small minority of tasks on a small minority of one x86 vendor's CPUs. The (potential) space consumption from the inline task_struct fpu will matter a lot more across all Linux users than AMX ever will. It would affect all tasks on all CPUs that have CET-S which will hopefully be the majority of x86 CPUs running Linux some day.