Message ID | 20240201171159.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | regset: use vmalloc() for regset_get_alloc() | expand |
On Thu, Feb 01, 2024 at 05:12:03PM -0800, Douglas Anderson wrote: > While browsing through ChromeOS crash reports, I found one with an > allocation failure that looked like this: > An order 7 allocation is (1 << 7) contiguous pages, or 512K. It's not > a surprise that this allocation failed on a system that's been running > for a while. > if (size > regset->n * regset->size) > size = regset->n * regset->size; > if (!p) { > - to_free = p = kzalloc(size, GFP_KERNEL); > + to_free = p = vmalloc(size); What the hell? Which regset could have lead to that? It would need to have the total size of register in excess of 256K. Seriously, which regset is that about? Note that we have just made sure that size is not greater than that product. size is unsigned int, so it's not as if a negative value passed to function could get through that test only to be interpreted as large positive later... Details, please.
On Thu, Feb 01, 2024 at 05:12:03PM -0800, Douglas Anderson wrote: > While browsing through ChromeOS crash reports, I found one with an > allocation failure that looked like this: > > chrome: page allocation failure: order:7, > mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), > nodemask=(null),cpuset=urgent,mems_allowed=0 That does seem bad ... > @@ -16,14 +17,14 @@ static int __regset_get(struct task_struct *target, > if (size > regset->n * regset->size) > size = regset->n * regset->size; > if (!p) { > - to_free = p = kzalloc(size, GFP_KERNEL); > + to_free = p = vmalloc(size); It's my impression that sometimes this size might be relatively small? Perhaps we should make this kvmalloc so that we can satisfy it from the slab allocator if it is small? Also, I assume that we don't rely on this memory being physically contiguous; we don't, for example, do I/O on it?
Hi, On Thu, Feb 1, 2024 at 5:22 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Thu, Feb 01, 2024 at 05:12:03PM -0800, Douglas Anderson wrote: > > While browsing through ChromeOS crash reports, I found one with an > > allocation failure that looked like this: > > > An order 7 allocation is (1 << 7) contiguous pages, or 512K. It's not > > a surprise that this allocation failed on a system that's been running > > for a while. > > > if (size > regset->n * regset->size) > > size = regset->n * regset->size; > > if (!p) { > > - to_free = p = kzalloc(size, GFP_KERNEL); > > + to_free = p = vmalloc(size); > > What the hell? Which regset could have lead to that? > It would need to have the total size of register in excess of > 256K. Seriously, which regset is that about? Note that we > have just made sure that size is not greater than that product. > size is unsigned int, so it's not as if a negative value passed > to function could get through that test only to be interpreted > as large positive later... > > Details, please. I can continue to dig more, but it is easy for me to reproduce this. On the stack is elf_core_dump() and it seems like we're getting a core dump of the chrome process. So I just arbitrarily look for the chrome GPU process: $ ps aux | grep gpu-process chronos 2075 3.0 1.1 34075552 95372 ? S<l 18:44 0:01 /opt/google/chrome/chrome --type=gpu-process ... Then I send it a quit: $ kill -quit 2075 I added some printouts for this allocation and there are a ton. Here's all of them, some of which are over 256K: [ 66.677393] DOUG: Allocating 272 bytes [ 66.688994] DOUG: Allocating 272 bytes [ 66.692921] DOUG: Allocating 528 bytes [ 66.696799] DOUG: Allocating 8 bytes [ 66.701058] DOUG: Allocating 272 bytes [ 66.704988] DOUG: Allocating 528 bytes [ 66.708875] DOUG: Allocating 8 bytes [ 66.712929] DOUG: Allocating 272 bytes [ 66.716845] DOUG: Allocating 528 bytes [ 66.720721] DOUG: Allocating 8 bytes [ 66.724752] DOUG: Allocating 272 bytes [ 66.728719] DOUG: Allocating 528 bytes [ 66.732621] DOUG: Allocating 8 bytes [ 66.736615] DOUG: Allocating 272 bytes [ 66.740584] DOUG: Allocating 528 bytes [ 66.744483] DOUG: Allocating 8 bytes [ 66.748507] DOUG: Allocating 272 bytes [ 66.752412] DOUG: Allocating 528 bytes [ 66.756328] DOUG: Allocating 8 bytes [ 66.760382] DOUG: Allocating 272 bytes [ 66.764356] DOUG: Allocating 528 bytes [ 66.768275] DOUG: Allocating 8 bytes [ 66.772236] DOUG: Allocating 272 bytes [ 66.776135] DOUG: Allocating 528 bytes [ 66.780013] DOUG: Allocating 8 bytes [ 66.787244] DOUG: Allocating 272 bytes [ 66.791175] DOUG: Allocating 528 bytes [ 66.795056] DOUG: Allocating 8 bytes [ 66.799101] DOUG: Allocating 272 bytes [ 66.803007] DOUG: Allocating 528 bytes [ 66.806930] DOUG: Allocating 8 bytes [ 66.810775] DOUG: Allocating 272 bytes [ 66.814668] DOUG: Allocating 528 bytes [ 66.818544] DOUG: Allocating 8 bytes [ 66.822409] DOUG: Allocating 272 bytes [ 66.826328] DOUG: Allocating 528 bytes [ 66.830258] DOUG: Allocating 8 bytes [ 66.834331] DOUG: Allocating 272 bytes [ 66.838510] DOUG: Allocating 528 bytes [ 66.842399] DOUG: Allocating 8 bytes [ 66.846301] DOUG: Allocating 272 bytes [ 66.850181] DOUG: Allocating 528 bytes [ 66.854051] DOUG: Allocating 8 bytes [ 66.857864] DOUG: Allocating 272 bytes [ 66.861745] DOUG: Allocating 528 bytes [ 66.865621] DOUG: Allocating 8 bytes [ 66.869495] DOUG: Allocating 272 bytes [ 66.873384] DOUG: Allocating 528 bytes [ 66.877261] DOUG: Allocating 8 bytes [ 66.892077] DOUG: Allocating 528 bytes [ 66.895978] DOUG: Allocating 16 bytes [ 66.899760] DOUG: Allocating 264 bytes [ 66.903624] DOUG: Allocating 264 bytes [ 66.907489] DOUG: Allocating 4 bytes [ 66.911184] DOUG: Allocating 279584 bytes [ 66.915392] DOUG: Allocating 8768 bytes [ 66.919354] DOUG: Allocating 65552 bytes [ 66.923415] DOUG: Allocating 64 bytes [ 66.927190] DOUG: Allocating 16 bytes [ 66.930968] DOUG: Allocating 8 bytes [ 66.934649] DOUG: Allocating 8 bytes [ 66.938332] DOUG: Allocating 528 bytes [ 66.942199] DOUG: Allocating 16 bytes [ 66.945970] DOUG: Allocating 264 bytes [ 66.949832] DOUG: Allocating 264 bytes [ 66.953702] DOUG: Allocating 4 bytes [ 66.957385] DOUG: Allocating 279584 bytes [ 66.961605] DOUG: Allocating 8768 bytes [ 66.965574] DOUG: Allocating 65552 bytes [ 66.969632] DOUG: Allocating 64 bytes [ 66.973405] DOUG: Allocating 16 bytes [ 66.977179] DOUG: Allocating 8 bytes [ 66.980862] DOUG: Allocating 8 bytes [ 66.984553] DOUG: Allocating 528 bytes [ 66.988416] DOUG: Allocating 16 bytes [ 66.992191] DOUG: Allocating 264 bytes [ 66.996046] DOUG: Allocating 264 bytes [ 66.999907] DOUG: Allocating 4 bytes [ 67.003590] DOUG: Allocating 279584 bytes [ 67.007773] DOUG: Allocating 8768 bytes [ 67.011732] DOUG: Allocating 65552 bytes [ 67.015789] DOUG: Allocating 64 bytes [ 67.019576] DOUG: Allocating 16 bytes [ 67.023366] DOUG: Allocating 8 bytes [ 67.027059] DOUG: Allocating 8 bytes [ 67.030753] DOUG: Allocating 528 bytes [ 67.034620] DOUG: Allocating 16 bytes [ 67.038402] DOUG: Allocating 264 bytes [ 67.042266] DOUG: Allocating 264 bytes [ 67.046144] DOUG: Allocating 4 bytes [ 67.049827] DOUG: Allocating 279584 bytes [ 67.054026] DOUG: Allocating 8768 bytes [ 67.057990] DOUG: Allocating 65552 bytes [ 67.062050] DOUG: Allocating 64 bytes [ 67.065826] DOUG: Allocating 16 bytes [ 67.069603] DOUG: Allocating 8 bytes [ 67.073285] DOUG: Allocating 8 bytes [ 67.076977] DOUG: Allocating 528 bytes [ 67.080836] DOUG: Allocating 16 bytes [ 67.084605] DOUG: Allocating 264 bytes [ 67.088461] DOUG: Allocating 264 bytes [ 67.092328] DOUG: Allocating 4 bytes [ 67.096015] DOUG: Allocating 279584 bytes [ 67.100214] DOUG: Allocating 8768 bytes [ 67.104182] DOUG: Allocating 65552 bytes [ 67.108245] DOUG: Allocating 64 bytes [ 67.112028] DOUG: Allocating 16 bytes [ 67.115804] DOUG: Allocating 8 bytes [ 67.119487] DOUG: Allocating 8 bytes [ 67.123168] DOUG: Allocating 528 bytes [ 67.127027] DOUG: Allocating 16 bytes [ 67.130806] DOUG: Allocating 264 bytes [ 67.134662] DOUG: Allocating 264 bytes [ 67.138527] DOUG: Allocating 4 bytes [ 67.142213] DOUG: Allocating 279584 bytes [ 67.146402] DOUG: Allocating 8768 bytes [ 67.150378] DOUG: Allocating 65552 bytes [ 67.154434] DOUG: Allocating 64 bytes [ 67.158209] DOUG: Allocating 16 bytes [ 67.161980] DOUG: Allocating 8 bytes [ 67.165665] DOUG: Allocating 8 bytes [ 67.169355] DOUG: Allocating 528 bytes [ 67.173219] DOUG: Allocating 16 bytes [ 67.176989] DOUG: Allocating 264 bytes [ 67.180847] DOUG: Allocating 264 bytes [ 67.184710] DOUG: Allocating 4 bytes [ 67.188385] DOUG: Allocating 279584 bytes [ 67.192569] DOUG: Allocating 8768 bytes [ 67.196522] DOUG: Allocating 65552 bytes [ 67.200570] DOUG: Allocating 64 bytes [ 67.204340] DOUG: Allocating 16 bytes [ 67.208109] DOUG: Allocating 8 bytes [ 67.211788] DOUG: Allocating 8 bytes [ 67.215468] DOUG: Allocating 528 bytes [ 67.219332] DOUG: Allocating 16 bytes [ 67.223108] DOUG: Allocating 264 bytes [ 67.226968] DOUG: Allocating 264 bytes [ 67.230834] DOUG: Allocating 4 bytes [ 67.234510] DOUG: Allocating 279584 bytes [ 67.238697] DOUG: Allocating 8768 bytes [ 67.242660] DOUG: Allocating 65552 bytes [ 67.246716] DOUG: Allocating 64 bytes [ 67.250487] DOUG: Allocating 16 bytes [ 67.254261] DOUG: Allocating 8 bytes [ 67.257955] DOUG: Allocating 8 bytes [ 67.261640] DOUG: Allocating 528 bytes [ 67.265497] DOUG: Allocating 16 bytes [ 67.269267] DOUG: Allocating 264 bytes [ 67.273131] DOUG: Allocating 264 bytes [ 67.277026] DOUG: Allocating 4 bytes [ 67.280721] DOUG: Allocating 279584 bytes [ 67.284914] DOUG: Allocating 8768 bytes [ 67.288868] DOUG: Allocating 65552 bytes [ 67.292927] DOUG: Allocating 64 bytes [ 67.296699] DOUG: Allocating 16 bytes [ 67.300479] DOUG: Allocating 8 bytes [ 67.304158] DOUG: Allocating 8 bytes [ 67.307848] DOUG: Allocating 528 bytes [ 67.311702] DOUG: Allocating 16 bytes [ 67.315469] DOUG: Allocating 264 bytes [ 67.319331] DOUG: Allocating 264 bytes [ 67.323196] DOUG: Allocating 4 bytes [ 67.326879] DOUG: Allocating 279584 bytes [ 67.331067] DOUG: Allocating 8768 bytes [ 67.335033] DOUG: Allocating 65552 bytes [ 67.339089] DOUG: Allocating 64 bytes [ 67.342866] DOUG: Allocating 16 bytes [ 67.346641] DOUG: Allocating 8 bytes [ 67.350323] DOUG: Allocating 8 bytes [ 67.354005] DOUG: Allocating 528 bytes [ 67.357869] DOUG: Allocating 16 bytes [ 67.361636] DOUG: Allocating 264 bytes [ 67.365492] DOUG: Allocating 264 bytes [ 67.369355] DOUG: Allocating 4 bytes [ 67.373040] DOUG: Allocating 279584 bytes [ 67.377218] DOUG: Allocating 8768 bytes [ 67.381179] DOUG: Allocating 65552 bytes [ 67.385228] DOUG: Allocating 64 bytes [ 67.389005] DOUG: Allocating 16 bytes [ 67.392784] DOUG: Allocating 8 bytes [ 67.396461] DOUG: Allocating 8 bytes [ 67.400150] DOUG: Allocating 528 bytes [ 67.404011] DOUG: Allocating 16 bytes [ 67.407792] DOUG: Allocating 264 bytes [ 67.411649] DOUG: Allocating 264 bytes [ 67.415506] DOUG: Allocating 4 bytes [ 67.419184] DOUG: Allocating 279584 bytes [ 67.423364] DOUG: Allocating 8768 bytes [ 67.427320] DOUG: Allocating 65552 bytes [ 67.431367] DOUG: Allocating 64 bytes [ 67.435146] DOUG: Allocating 16 bytes [ 67.438923] DOUG: Allocating 8 bytes [ 67.442602] DOUG: Allocating 8 bytes [ 67.446286] DOUG: Allocating 528 bytes [ 67.450143] DOUG: Allocating 16 bytes [ 67.453913] DOUG: Allocating 264 bytes [ 67.457775] DOUG: Allocating 264 bytes [ 67.461637] DOUG: Allocating 4 bytes [ 67.465323] DOUG: Allocating 279584 bytes [ 67.469501] DOUG: Allocating 8768 bytes [ 67.473463] DOUG: Allocating 65552 bytes [ 67.477511] DOUG: Allocating 64 bytes [ 67.481283] DOUG: Allocating 16 bytes [ 67.485056] DOUG: Allocating 8 bytes [ 67.488735] DOUG: Allocating 8 bytes [ 67.492428] DOUG: Allocating 528 bytes [ 67.496298] DOUG: Allocating 16 bytes [ 67.500072] DOUG: Allocating 264 bytes [ 67.503932] DOUG: Allocating 264 bytes [ 67.507803] DOUG: Allocating 4 bytes [ 67.511484] DOUG: Allocating 279584 bytes [ 67.515667] DOUG: Allocating 8768 bytes [ 67.519624] DOUG: Allocating 65552 bytes [ 67.523679] DOUG: Allocating 64 bytes [ 67.527447] DOUG: Allocating 16 bytes [ 67.531222] DOUG: Allocating 8 bytes [ 67.534907] DOUG: Allocating 8 bytes [ 67.538593] DOUG: Allocating 528 bytes [ 67.542458] DOUG: Allocating 16 bytes [ 67.546225] DOUG: Allocating 264 bytes [ 67.550090] DOUG: Allocating 264 bytes [ 67.553956] DOUG: Allocating 4 bytes [ 67.557634] DOUG: Allocating 279584 bytes [ 67.561818] DOUG: Allocating 8768 bytes [ 67.565775] DOUG: Allocating 65552 bytes [ 67.569823] DOUG: Allocating 64 bytes [ 67.573602] DOUG: Allocating 16 bytes [ 67.577380] DOUG: Allocating 8 bytes [ 67.581060] DOUG: Allocating 8 bytes [ 67.584748] DOUG: Allocating 528 bytes [ 67.588607] DOUG: Allocating 16 bytes [ 67.592384] DOUG: Allocating 264 bytes [ 67.596240] DOUG: Allocating 264 bytes [ 67.600105] DOUG: Allocating 4 bytes [ 67.603786] DOUG: Allocating 279584 bytes [ 67.607968] DOUG: Allocating 8768 bytes [ 67.611927] DOUG: Allocating 65552 bytes [ 67.615979] DOUG: Allocating 64 bytes [ 67.619757] DOUG: Allocating 16 bytes [ 67.623529] DOUG: Allocating 8 bytes [ 67.627216] DOUG: Allocating 8 bytes The above printouts were taken on a sc7180-trogdor-lazor device running mainline (roughly "Linux localhost 6.8.0-rc2") booted w/ ChromeOS userspace. If you need me to dig more into how coredumps work then I can see if I can track down exactly what part of the coredump is causing it to need the big allocation. "chrome" is a bit of a beast of an application, though. I'd also note that chrome makes extensive use of address space randomization which uses up huge amounts of virtual address space, so a shot in the dark is that maybe that has something to do with it? Looking at the virtual address space of Chrome in "top" shows stuff like: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2012 chronos 12 -8 32.7g 230520 160504 S 1.0 2.9 0:12.49 chrome 6044 chronos 12 -8 32.5g 95204 61888 S 1.0 1.2 0:05.90 chrome 2191 chronos 12 -8 107.0g 72200 51264 S 0.0 0.9 0:00.08 chrome -Doug
Hi, On Thu, Feb 1, 2024 at 5:24 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Feb 01, 2024 at 05:12:03PM -0800, Douglas Anderson wrote: > > While browsing through ChromeOS crash reports, I found one with an > > allocation failure that looked like this: > > > > chrome: page allocation failure: order:7, > > mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), > > nodemask=(null),cpuset=urgent,mems_allowed=0 > > That does seem bad ... > > > @@ -16,14 +17,14 @@ static int __regset_get(struct task_struct *target, > > if (size > regset->n * regset->size) > > size = regset->n * regset->size; > > if (!p) { > > - to_free = p = kzalloc(size, GFP_KERNEL); > > + to_free = p = vmalloc(size); > > It's my impression that sometimes this size might be relatively small? > Perhaps we should make this kvmalloc so that we can satisfy it from the > slab allocator if it is small? Right. Sometimes it's small. It feels sad to me that somehow vmalloc() of small sizes would be much less efficient than kvmalloc() of small sizes, but I can change it to that if you want. It feels like we should use kmalloc() if we need it to be contiguous, kvmalloc() if we know that there will be big efficiency gains with things being contiguous but we can get by with non-contiguous, and vmalloc() if we just don't care. ;-) ...anyway, I'll spin v2 with kvmalloc(). > Also, I assume that we don't rely on this memory being physically > contiguous; we don't, for example, do I/O on it? As far as I can tell we don't. I had never looked at or thought about this code before today and so all I have is ~an hour of code analysis behind me, so if someone tells me I'm wrong then I'll believe them. -Doug
On Thu, Feb 01, 2024 at 06:54:51PM -0800, Doug Anderson wrote: > > What the hell? Which regset could have lead to that? > > It would need to have the total size of register in excess of > > 256K. Seriously, which regset is that about? Note that we > > have just made sure that size is not greater than that product. > > size is unsigned int, so it's not as if a negative value passed > > to function could get through that test only to be interpreted > > as large positive later... > > > > Details, please. > > I can continue to dig more, but it is easy for me to reproduce this. > On the stack is elf_core_dump() and it seems like we're getting a core > dump of the chrome process. So I just arbitrarily look for the chrome > GPU process: > > $ ps aux | grep gpu-process > chronos 2075 3.0 1.1 34075552 95372 ? S<l 18:44 0:01 > /opt/google/chrome/chrome --type=gpu-process ... > > Then I send it a quit: > > $ kill -quit 2075 > > I added some printouts for this allocation and there are a ton. Here's > all of them, some of which are over 256K: Well, the next step would be to see which regset it is - if you see that kind of allocation, print regset->n, regset->size and regset->core_note_type.
Hi, On Thu, Feb 1, 2024 at 7:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Thu, Feb 01, 2024 at 06:54:51PM -0800, Doug Anderson wrote: > > > What the hell? Which regset could have lead to that? > > > It would need to have the total size of register in excess of > > > 256K. Seriously, which regset is that about? Note that we > > > have just made sure that size is not greater than that product. > > > size is unsigned int, so it's not as if a negative value passed > > > to function could get through that test only to be interpreted > > > as large positive later... > > > > > > Details, please. > > > > I can continue to dig more, but it is easy for me to reproduce this. > > On the stack is elf_core_dump() and it seems like we're getting a core > > dump of the chrome process. So I just arbitrarily look for the chrome > > GPU process: > > > > $ ps aux | grep gpu-process > > chronos 2075 3.0 1.1 34075552 95372 ? S<l 18:44 0:01 > > /opt/google/chrome/chrome --type=gpu-process ... > > > > Then I send it a quit: > > > > $ kill -quit 2075 > > > > I added some printouts for this allocation and there are a ton. Here's > > all of them, some of which are over 256K: > > Well, the next step would be to see which regset it is - if you > see that kind of allocation, print regset->n, regset->size and > regset->core_note_type. Of course! Here are the big ones: [ 45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16, core_note_type=1029 [ 45.884809] DOUG: Allocating 8768 bytes, n=548, size=16, core_note_type=1035 [ 45.893958] DOUG: Allocating 65552 bytes, n=4097, size=16, core_note_type=1036 -Doug
On Thu, Feb 01, 2024 at 07:15:48PM -0800, Doug Anderson wrote: > > > > Well, the next step would be to see which regset it is - if you > > see that kind of allocation, print regset->n, regset->size and > > regset->core_note_type. > > Of course! Here are the big ones: > > [ 45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16, > core_note_type=1029 0x405, NT_ARM_SVE [REGSET_SVE] = { /* Scalable Vector Extension */ .core_note_type = NT_ARM_SVE, .n = DIV_ROUND_UP(SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE), SVE_VQ_BYTES), .size = SVE_VQ_BYTES, IDGI. Wasn't SVE up to 32 * 2Kbit, i.e. 8Kbyte max? Any ARM folks around? Sure, I understand that it's variable-sized and we want to allocate enough for the worst case, but can we really get about 280Kb there? Context switches would be really unpleasant on such boxen... > [ 45.884809] DOUG: Allocating 8768 bytes, n=548, size=16, core_note_type=1035 > [ 45.893958] DOUG: Allocating 65552 bytes, n=4097, size=16, > core_note_type=1036 0x40c, NT_ARM_ZA. /* * ZA is a single register but it's variably sized and * the ptrace core requires that the size of any data * be an exact multiple of the configured register * size so report as though we had SVE_VQ_BYTES * registers. These values aren't exposed to * userspace. */ .n = DIV_ROUND_UP(ZA_PT_SIZE(SME_VQ_MAX), SVE_VQ_BYTES), .size = SVE_VQ_BYTES,
On Fri, Feb 02, 2024 at 03:49:25AM +0000, Al Viro wrote: > On Thu, Feb 01, 2024 at 07:15:48PM -0800, Doug Anderson wrote: > > > > > > Well, the next step would be to see which regset it is - if you > > > see that kind of allocation, print regset->n, regset->size and > > > regset->core_note_type. > > > > Of course! Here are the big ones: > > > > [ 45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16, > > core_note_type=1029 > > 0x405, NT_ARM_SVE > [REGSET_SVE] = { /* Scalable Vector Extension */ > .core_note_type = NT_ARM_SVE, > .n = DIV_ROUND_UP(SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE), > SVE_VQ_BYTES), > .size = SVE_VQ_BYTES, > > IDGI. Wasn't SVE up to 32 * 2Kbit, i.e. 8Kbyte max? Any ARM folks around? > Sure, I understand that it's variable-sized and we want to allocate enough > for the worst case, but can we really get about 280Kb there? Context switches > would be really unpleasant on such boxen... FWIW, this apparently intends to be "variable, up to SVE_PT_SIZE(...) bytes"; no idea if SVE_PT_SIZE is the right thing to use here.
Hi, On Thu, Feb 1, 2024 at 8:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, Feb 02, 2024 at 03:49:25AM +0000, Al Viro wrote: > > On Thu, Feb 01, 2024 at 07:15:48PM -0800, Doug Anderson wrote: > > > > > > > > Well, the next step would be to see which regset it is - if you > > > > see that kind of allocation, print regset->n, regset->size and > > > > regset->core_note_type. > > > > > > Of course! Here are the big ones: > > > > > > [ 45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16, > > > core_note_type=1029 > > > > 0x405, NT_ARM_SVE > > [REGSET_SVE] = { /* Scalable Vector Extension */ > > .core_note_type = NT_ARM_SVE, > > .n = DIV_ROUND_UP(SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE), > > SVE_VQ_BYTES), > > .size = SVE_VQ_BYTES, > > > > IDGI. Wasn't SVE up to 32 * 2Kbit, i.e. 8Kbyte max? Any ARM folks around? > > Sure, I understand that it's variable-sized and we want to allocate enough > > for the worst case, but can we really get about 280Kb there? Context switches > > would be really unpleasant on such boxen... > > FWIW, this apparently intends to be "variable, up to SVE_PT_SIZE(...) bytes"; > no idea if SVE_PT_SIZE is the right thing to use here. +folks from `./scripts/get_maintainer.pl -f arch/arm64/kernel/ptrace.c` Trying to follow the macros to see where "n" comes from is a maze of twisty little passages, all alike. Hopefully someone from the ARM world can help tell if the value of 17474 for n here is correct or if something is wonky.
On Fri, Feb 02, 2024 at 08:24:17AM -0800, Doug Anderson wrote: > Hi, > > On Thu, Feb 1, 2024 at 8:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > On Fri, Feb 02, 2024 at 03:49:25AM +0000, Al Viro wrote: > > > On Thu, Feb 01, 2024 at 07:15:48PM -0800, Doug Anderson wrote: > > > > > > > > > > Well, the next step would be to see which regset it is - if you > > > > > see that kind of allocation, print regset->n, regset->size and > > > > > regset->core_note_type. > > > > > > > > Of course! Here are the big ones: > > > > > > > > [ 45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16, > > > > core_note_type=1029 > > > > > > 0x405, NT_ARM_SVE > > > [REGSET_SVE] = { /* Scalable Vector Extension */ > > > .core_note_type = NT_ARM_SVE, > > > .n = DIV_ROUND_UP(SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE), > > > SVE_VQ_BYTES), > > > .size = SVE_VQ_BYTES, > > > > > > IDGI. Wasn't SVE up to 32 * 2Kbit, i.e. 8Kbyte max? Any ARM folks around? > > > Sure, I understand that it's variable-sized and we want to allocate enough > > > for the worst case, but can we really get about 280Kb there? Context switches > > > would be really unpleasant on such boxen... > > > > FWIW, this apparently intends to be "variable, up to SVE_PT_SIZE(...) bytes"; > > no idea if SVE_PT_SIZE is the right thing to use here. > > +folks from `./scripts/get_maintainer.pl -f arch/arm64/kernel/ptrace.c` > > Trying to follow the macros to see where "n" comes from is a maze of > twisty little passages, all alike. Hopefully someone from the ARM > world can help tell if the value of 17474 for n here is correct or if > something is wonky. It might be interesting to have it print the return value of __regset_get() in those cases; if *that* is huge, we really have a problem. If it ends up small enough to fit into few pages, OTOH... SVE_VQ_MAX is defined as 255; is that really in units of 128 bits? IOW, do we really expect to support 32Kbit registers? That would drive the size into that range, all right, but it would really suck on context switches. I could be misreading it, though - the macros in there are not easy to follow and I've never dealt with SVE before, so take the above with a cartload of salt.
On Fri, Feb 02, 2024 at 04:49:47PM +0000, Al Viro wrote: > > +folks from `./scripts/get_maintainer.pl -f arch/arm64/kernel/ptrace.c` > > > > Trying to follow the macros to see where "n" comes from is a maze of > > twisty little passages, all alike. Hopefully someone from the ARM > > world can help tell if the value of 17474 for n here is correct or if > > something is wonky. > > It might be interesting to have it print the return value of __regset_get() > in those cases; if *that* is huge, we really have a problem. If it ends up > small enough to fit into few pages, OTOH... > > SVE_VQ_MAX is defined as 255; is that really in units of 128 bits? IOW, > do we really expect to support 32Kbit registers? That would drive the > size into that range, all right, but it would really suck on context > switches. > > I could be misreading it, though - the macros in there are not easy to > follow and I've never dealt with SVE before, so take the above with > a cartload of salt. Worse - it's SVE_VQ_MAX is 512; sorry about the confusion. OK, that would certainly explain the size (header + 32 registers, each up to 512 * 16 bytes), but... ouch.
On Fri, Feb 02, 2024 at 03:49:25AM +0000, Al Viro wrote: > On Thu, Feb 01, 2024 at 07:15:48PM -0800, Doug Anderson wrote: > > [ 45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16, > > core_note_type=1029 > 0x405, NT_ARM_SVE > [REGSET_SVE] = { /* Scalable Vector Extension */ > .core_note_type = NT_ARM_SVE, > .n = DIV_ROUND_UP(SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE), > SVE_VQ_BYTES), > .size = SVE_VQ_BYTES, > IDGI. Wasn't SVE up to 32 * 2Kbit, i.e. 8Kbyte max? Any ARM folks around? > Sure, I understand that it's variable-sized and we want to allocate enough > for the worst case, but can we really get about 280Kb there? Context switches > would be really unpleasant on such boxen... The architecture itself is limited to 2048 bit vector lengths, and practical implementations have thus far not exceeded 512 bits with the overwhelming majority of systems being 128 bit. 2048 is commonly seen in emulation though. As well as the 32 Z registers we have 16 P registers of VQ*2 bytes plus one more register FFR the same size as the P registers and a header describing the VL and specific format of the data, all in this regset. The Linux ABI defines the maximum vector length much larger than the architecture allows and that define does flow into the kernel code, I believe this was based on consideration of bits 8:4 of ZCR_ELx[1] which look like they're earmarked for potential future expansion should 2048 bits ever prove to be insufficient. We should really do something like what we did for SME and define down what ptrace uses to the actual architectural maximum since no system will ever see any more than that, that'd still result in large allocations but less impressively and wastefully so. I'll go and look at doing a patch for that just now. Unfortunately SVE_VQ_MAX is in the uapi headers, we've already stopped using it in the test programs due to the overallocation. [1] https://developer.arm.com/documentation/ddi0601/2023-12/AArch64-Registers/ZCR-EL1--SVE-Control-Register--EL1-?lang=en > > [ 45.884809] DOUG: Allocating 8768 bytes, n=548, size=16, core_note_type=1035 > > [ 45.893958] DOUG: Allocating 65552 bytes, n=4097, size=16, > > core_note_type=1036 > 0x40c, NT_ARM_ZA. > /* > * ZA is a single register but it's variably sized and > * the ptrace core requires that the size of any data > * be an exact multiple of the configured register > * size so report as though we had SVE_VQ_BYTES > * registers. These values aren't exposed to > * userspace. > */ > .n = DIV_ROUND_UP(ZA_PT_SIZE(SME_VQ_MAX), SVE_VQ_BYTES), > .size = SVE_VQ_BYTES, Yup, and SME_VQ_MAX is defined to the actual architectural maximum of 2048 largely due to issues with the size of the allocation for ptrace. There are not yet any physical implementations of SME so I can't comment on the actual vector lengths we'll observe in the immediate future. I see there's a comment update needed there for s/SVE/SME/ too.
On Fri, Feb 02, 2024 at 04:55:24PM +0000, Al Viro wrote: > On Fri, Feb 02, 2024 at 04:49:47PM +0000, Al Viro wrote: > > > +folks from `./scripts/get_maintainer.pl -f arch/arm64/kernel/ptrace.c` > > > > > > Trying to follow the macros to see where "n" comes from is a maze of > > > twisty little passages, all alike. Hopefully someone from the ARM > > > world can help tell if the value of 17474 for n here is correct or if > > > something is wonky. Nope, that's the "correct" answer... > > > > It might be interesting to have it print the return value of __regset_get() > > in those cases; if *that* is huge, we really have a problem. If it ends up > > small enough to fit into few pages, OTOH... > > > > SVE_VQ_MAX is defined as 255; is that really in units of 128 bits? IOW, > > do we really expect to support 32Kbit registers? That would drive the > > size into that range, all right, but it would really suck on context > > switches. > > > > I could be misreading it, though - the macros in there are not easy to > > follow and I've never dealt with SVE before, so take the above with > > a cartload of salt. > > Worse - it's SVE_VQ_MAX is 512; sorry about the confusion. OK, that would > certainly explain the size (header + 32 registers, each up to 512 * 16 bytes), > but... ouch. Mark Brown [+ Cc] has been taking care of SVE in my absence, but from memory: The SVE architecture has a really big maximum vector size (16 * 128 = 2048 bits), and there is a theoretical possibility of it getting bigger in the future, though unlikely. Real platforms to date have a much smaller limit, though Qemu can go up to 2048 bits IIUC. My aim when working on the ABI was to future-proof it against foreseeable expansion on the architecture side, but this does mean that we cannot statically determine a sane limit for the vector size. I suppose we could have had a more sane limit built into the kernel or a Kconfig option for it, but it seemed simpler just to determine the size dynamically depending on the task's current state. This is not so important for coredumps, but for the the gdbstub wire protocol etc. it seemed undesirable to have the regset larger than needed. Hence the reason for adding ->get_size() in 27e64b4be4b8 ("regset: Add support for dynamically sized regsets"). What I guess was not so obvious from the commit message is the expected relationship between the actual and maximum possible size of the regset: for SVE the actual size is in practice going to be *much* smaller than the max, while the max is crazy large because of being an ABI design limit chosen for futureproofing purposes. So, if the only reason for trying to migrate to vmalloc() is to cope with an insanely sized regset on arm64, I think somehow or other we can avoid that. Options: a) bring back ->get_size() so that we can allocate the correct size before generating the regset data; b) make aarch64_regsets[] __ro_after_init and set aarch64_regsets[REGSET_SVE].n based on the boot-time probed maximum size (which will be sane); or c) allow membufs to grow if needed (sounds fragile though, and may be hard to justify just for one arch?). Thoughts? If people don't want to bring back get_size(), then (b) doesn't look too bad. Cheers ---Dave
Hi, On Fri, Feb 2, 2024 at 10:08 AM Dave Martin <Dave.Martin@arm.com> wrote: > > So, if the only reason for trying to migrate to vmalloc() is to cope > with an insanely sized regset on arm64, I think somehow or other we can > avoid that. Right. The only reason for the patch to switch to vmalloc() was in reaction to seeing the order 7 memory allocation. If we can decrease that to something sensible then I'm happy enough keeping the allocation as kmalloc(). > Options: > > a) bring back ->get_size() so that we can allocate the correct size > before generating the regset data; > > b) make aarch64_regsets[] __ro_after_init and set > aarch64_regsets[REGSET_SVE].n based on the boot-time probed maximum size > (which will be sane); or > > c) allow membufs to grow if needed (sounds fragile though, and may be > hard to justify just for one arch?). > > > Thoughts? > > If people don't want to bring back get_size(), then (b) doesn't look > too bad. Either a) or b) sounds fine to me, but I'm just a visitor to this code so maybe I'll let the adults in the room chime in with their opinions. ;-) Also: if you think it's fruitful for me to try to write a patch to do either of those then I can, but I also wouldn't object at all to someone else writing a patch to fix this and I can just provide a Tested-by and/or Reviewed-by. Let me know. -Doug
On Fri, Feb 02, 2024 at 06:07:54PM +0000, Dave Martin wrote: > So, if the only reason for trying to migrate to vmalloc() is to cope > with an insanely sized regset on arm64, I think somehow or other we can > avoid that. With SME we do routinely see the full glory of the 64K regset for ZA in emulated systems so I think we have to treat it as an issue. > Options: > a) bring back ->get_size() so that we can allocate the correct size > before generating the regset data; > b) make aarch64_regsets[] __ro_after_init and set > aarch64_regsets[REGSET_SVE].n based on the boot-time probed maximum size > (which will be sane); or Either of those seems sensible to me, a function would minimise the size of allocations based on the process configuration which would be nice and given that we're doing allocations it's probably reasonable overhead. > c) allow membufs to grow if needed (sounds fragile though, and may be > hard to justify just for one arch?). I'm having a hard time getting enthusiastic about that one for the reasons you mention. We can also just lower the maximum size we tell the ptrace core to the actual architectural maximum since AFAICT we don't expose that anywhere external, I've got a patch in CI for that. We'd still be allocating more memory than we need for practical systems but less extravagantly so. It seems more suitable for an immediate fix for people to pick up for production. It did occur to me at some point in the past that we should avoid telling the core about regsets that aren't physically supported on the current system, I didn't get round to looking at that yet.
Hi, On Fri, Feb 2, 2024 at 11:43 AM Mark Brown <broonie@kernel.org> wrote: > > On Fri, Feb 02, 2024 at 06:07:54PM +0000, Dave Martin wrote: > > > So, if the only reason for trying to migrate to vmalloc() is to cope > > with an insanely sized regset on arm64, I think somehow or other we can > > avoid that. > > With SME we do routinely see the full glory of the 64K regset for ZA in > emulated systems so I think we have to treat it as an issue. Ah, got it. 64K is much less likely to be as big of a problem (only an order 4 allocation), but you're right that it's still a size where kvmalloc() would be an improvement. With that in mind I'll plan to send out a v2 of my patch where I use kvmalloc() instead of vmalloc() and update the commit description a bit, including a link to this thread. Then I will assume that others on this thread will move forward with actually making the allocations smaller. Please yell if the above sounds wrong. :-) -Doug
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 5397b552fbeb..ac178ad38823 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1928,7 +1928,7 @@ static void free_note_info(struct elf_note_info *info) threads = t->next; WARN_ON(t->notes[0].data && t->notes[0].data != &t->prstatus); for (i = 1; i < info->thread_notes; ++i) - kfree(t->notes[i].data); + kvfree(t->notes[i].data); kfree(t); } kfree(info->psinfo.data); diff --git a/kernel/regset.c b/kernel/regset.c index 586823786f39..ed8d8cf3a22c 100644 --- a/kernel/regset.c +++ b/kernel/regset.c @@ -2,6 +2,7 @@ #include <linux/export.h> #include <linux/slab.h> #include <linux/regset.h> +#include <linux/vmalloc.h> static int __regset_get(struct task_struct *target, const struct user_regset *regset, @@ -16,14 +17,14 @@ static int __regset_get(struct task_struct *target, if (size > regset->n * regset->size) size = regset->n * regset->size; if (!p) { - to_free = p = kzalloc(size, GFP_KERNEL); + to_free = p = vmalloc(size); if (!p) return -ENOMEM; } res = regset->regset_get(target, regset, (struct membuf){.p = p, .left = size}); if (res < 0) { - kfree(to_free); + vfree(to_free); return res; } *data = p; @@ -71,6 +72,6 @@ int copy_regset_to_user(struct task_struct *target, ret = regset_get_alloc(target, regset, size, &buf); if (ret > 0) ret = copy_to_user(data, buf, ret) ? -EFAULT : 0; - kfree(buf); + vfree(buf); return ret; }
While browsing through ChromeOS crash reports, I found one with an allocation failure that looked like this: chrome: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=urgent,mems_allowed=0 CPU: 7 PID: 3295 Comm: chrome Not tainted 5.15.133-20574-g8044615ac35c #1 (HASH:1162 1) Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT) Call trace: dump_backtrace+0x0/0x1ec show_stack+0x20/0x2c dump_stack_lvl+0x60/0x78 dump_stack+0x18/0x38 warn_alloc+0x104/0x174 __alloc_pages+0x5f0/0x6e4 kmalloc_order+0x44/0x98 kmalloc_order_trace+0x34/0x124 __kmalloc+0x228/0x36c __regset_get+0x68/0xcc regset_get_alloc+0x1c/0x28 elf_core_dump+0x3d8/0xd8c do_coredump+0xeb8/0x1378 get_signal+0x14c/0x804 ... An order 7 allocation is (1 << 7) contiguous pages, or 512K. It's not a surprise that this allocation failed on a system that's been running for a while. In this case we're just generating a core dump and there's no reason we need contiguous memory. Change the allocation to vmalloc(). We'll change the free in binfmt_elf to kvfree() which works regardless of how the memory was allocated. Signed-off-by: Douglas Anderson <dianders@chromium.org> --- I don't know this code at all, but I _think_ I've identified the places where the memory is freed correctly. Please double-check that I didn't miss anything obvious, though! fs/binfmt_elf.c | 2 +- kernel/regset.c | 7 ++++--- 2 files changed, 5 insertions(+), 4 deletions(-)