Message ID | CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | lockref scalability on x86-64 vs cpu_relax | expand |
[ Adding linux-arch, which is relevant but not very specific, and the arm64 and powerpc maintainers that are the more specific cases for an architecture where this might actually matter. See https://lore.kernel.org/all/CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com/ for original full email, but it might be sufficiently clear just from this heavily cut-down context too ] Side note on your access() changes - if it turns out that you can remove all the cred games, we should possibly then revert my old commit d7852fbd0f04 ("access: avoid the RCU grace period for the temporary subjective credentials") which avoided the biggest issue with the unnecessary cred switching. I *think* access() is the only user of that special 'non_rcu' thing, but it is possible that the whole 'non_rcu' thing ends up mattering for cases where the cred actually does change because euid != uid (ie suid programs), so this would need a bit more effort to do performance testing on. On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > To my understanding on said architecture failed cmpxchg still grants you > exclusive access to the cacheline, making immediate retry preferable > when trying to inc/dec unless a certain value is found. I actually suspect that is _always_ the case - this is not like a contended spinlock where we want to pause because we're waiting for the value to change and become unlocked, this cmpxchg loop is likely always better off just retrying with the new value. That said, the "likely always better off" is purely about performance. So I have this suspicion that the reason Tony added the cpu_relax() was simply not about performance, but about other issues, like fairness in SMT situations. That said, evern from a fairness perspective the cpu_relax() sounds a bit odd and unlikely - we're literally yielding when we lost a race, so it hurts the _loser_, not the winner, and thus might make fairness worse too. I dunno. Tony may have some memory of what the issue was. > ... without numbers attached to it. Given the above linked thread it > looks like the arch this was targeting was itanium, not x86-64, but > the change landed for everyone. Yeah, if it was ia64-only, it's a non-issue these days. It's dead and in pure maintenance mode from a kernel perspective (if even that). > Later it was further augmented with: > commit 893a7d32e8e04ca4d6c882336b26ed660ca0a48d > Author: Jan Glauber <jan.glauber@gmail.com> > Date: Wed Jun 5 15:48:49 2019 +0200 > > lockref: Limit number of cmpxchg loop retries > [snip] > With the retry limit the performance of an open-close testcase > improved between 60-70% on ThunderX2. > > While the benchmark was specifically on ThunderX2, the change once more > was made for all archs. Actually, in that case I did ask for the test to be run on x86 hardware too, and exactly like you found: > I should note in my tests the retry limit was never reached fwiw. the max loop retry number just isn't an issue. It fundamentally only affects extremely unfair platforms, so it's arguably always the right thing to do. So it may be "ThunderX2 specific" in that that is where it was noticed, but I think we can safely just consider the max loop thing to be a generic safety net that hopefully simply never triggers in practice on any sane platform. > All that said, I think the thing to do here is to replace cpu_relax > with a dedicated arch-dependent macro, akin to the following: I would actually prefer just removing it entirely and see if somebody else hollers. You have the numbers to prove it hurts on real hardware, and I don't think we have any numbers to the contrary. So I think it's better to trust the numbers and remove it as a failure, than say "let's just remove it on x86-64 and leave everybody else with the potentially broken code" Because I do think that a cmpxchg loop that updates the value it compares and exchanges is fundamentally different from a "busy-loop, trying to read while locked", and with your numbers as ammunition, I think it's better to just remove that cpu_relax() entirely. Then other architectures can try to run their numbers, and only *if* it then turns out that they have a reason to do something else should we make this conditional and different on different architectures. Let's try to keep the code as common as possibly until we have hard evidence for special cases, in other words. Linus
> Yeah, if it was ia64-only, it's a non-issue these days. It's dead and > in pure maintenance mode from a kernel perspective (if even that). There's not much "simultaneous" in the SMT on ia64. One thread in a spin loop will hog the core until the h/w switches to the other thread some number of cycles (hundreds, thousands? I really can remember). So I was pretty generous with dropping cpu_relax() into any kind of spin loop. Is it time yet for: $ git rm -r arch/ia64 -Tony
On Thu, Jan 12, 2023 at 6:31 PM Luck, Tony <tony.luck@intel.com> wrote: > > There's not much "simultaneous" in the SMT on ia64. Oh, I forgot about the whole SoEMT fiasco. Yeah, that might make ia64 act a bit differently here. But I don't think anybody cares any more, so I don't think that merits making this a per-architecture choice. The s390 people hated cpu_relax() here, but for them it was really because it was bad *everywhere*, and they just made it a no-op (see commit 22b6430d3665 "locking/core, s390: Make cpu_relax() a barrier again"). There had been a (failed) attempt at "cpu_relax_lowlatency()" for the s390 issues. Linus
On 1/13/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > Side note on your access() changes - if it turns out that you can > remove all the cred games, we should possibly then revert my old > commit d7852fbd0f04 ("access: avoid the RCU grace period for the > temporary subjective credentials") which avoided the biggest issue > with the unnecessary cred switching. > > I *think* access() is the only user of that special 'non_rcu' thing, > but it is possible that the whole 'non_rcu' thing ends up mattering > for cases where the cred actually does change because euid != uid (ie > suid programs), so this would need a bit more effort to do performance > testing on. > I don't think the games are avoidable. For one I found non-root processes with non-empty cap_effective even on my laptop, albeit I did not check how often something like this is doing access(). Discussion for another time. > On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >> All that said, I think the thing to do here is to replace cpu_relax >> with a dedicated arch-dependent macro, akin to the following: > > I would actually prefer just removing it entirely and see if somebody > else hollers. You have the numbers to prove it hurts on real hardware, > and I don't think we have any numbers to the contrary. > > So I think it's better to trust the numbers and remove it as a > failure, than say "let's just remove it on x86-64 and leave everybody > else with the potentially broken code" > [snip] > Then other architectures can try to run their numbers, and only *if* > it then turns out that they have a reason to do something else should > we make this conditional and different on different architectures. > > Let's try to keep the code as common as possibly until we have hard > evidence for special cases, in other words. > I did not want to make such a change without redoing the ThunderX2 benchmark, or at least something else arm64-y. I may be able to bench it tomorrow on whatever arm-y stuff can be found on Amazon's EC2, assuming no arm64 people show up with their results. Even then IMHO the safest route is to patch it out on x86-64 and give other people time to bench their archs as they get around to it, and ultimately whack the thing if it turns out nobody benefits from it. I would say beats backpedaling on the removal, but I'm not going to fight for it. That said, does waiting for arm64 numbers and/or producing them for the removal commit message sound like a plan? If so, I'll post soon(tm).
On Fri Jan 13, 2023 at 10:13 AM AEST, Linus Torvalds wrote: > [ Adding linux-arch, which is relevant but not very specific, and the > arm64 and powerpc maintainers that are the more specific cases for an > architecture where this might actually matter. > > See > > https://lore.kernel.org/all/CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com/ > > for original full email, but it might be sufficiently clear just > from this heavily cut-down context too ] > > Side note on your access() changes - if it turns out that you can > remove all the cred games, we should possibly then revert my old > commit d7852fbd0f04 ("access: avoid the RCU grace period for the > temporary subjective credentials") which avoided the biggest issue > with the unnecessary cred switching. > > I *think* access() is the only user of that special 'non_rcu' thing, > but it is possible that the whole 'non_rcu' thing ends up mattering > for cases where the cred actually does change because euid != uid (ie > suid programs), so this would need a bit more effort to do performance > testing on. > > On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > To my understanding on said architecture failed cmpxchg still grants you > > exclusive access to the cacheline, making immediate retry preferable > > when trying to inc/dec unless a certain value is found. > > I actually suspect that is _always_ the case - this is not like a > contended spinlock where we want to pause because we're waiting for > the value to change and become unlocked, this cmpxchg loop is likely > always better off just retrying with the new value. Yes this should be true for powerpc (POWER CPUs, at least). > That said, the "likely always better off" is purely about performance. > > So I have this suspicion that the reason Tony added the cpu_relax() > was simply not about performance, but about other issues, like > fairness in SMT situations. > > That said, evern from a fairness perspective the cpu_relax() sounds a > bit odd and unlikely - we're literally yielding when we lost a race, > so it hurts the _loser_, not the winner, and thus might make fairness > worse too. Worse is that we've also actually just *won* a race when the cmpxchg comes back, i.e., to get the line exclusive in our cache line. Then we'll just sit there waiting and probably holding off other snoopers for a while. I don't see much of a fairness concern really. If there's a lot of contention here we'll be stalled for a long time waiting on the line, so SMT heuristics had better send resources to other threads making better progress anyway as it should for any cache miss situation. So this loop shouldn't be hogging up a lot of resources from the other thread(s). > I dunno. Tony may have some memory of what the issue was. > > > ... without numbers attached to it. Given the above linked thread it > > looks like the arch this was targeting was itanium, not x86-64, but > > the change landed for everyone. > > Yeah, if it was ia64-only, it's a non-issue these days. It's dead and > in pure maintenance mode from a kernel perspective (if even that). > > > Later it was further augmented with: > > commit 893a7d32e8e04ca4d6c882336b26ed660ca0a48d > > Author: Jan Glauber <jan.glauber@gmail.com> > > Date: Wed Jun 5 15:48:49 2019 +0200 > > > > lockref: Limit number of cmpxchg loop retries > > [snip] > > With the retry limit the performance of an open-close testcase > > improved between 60-70% on ThunderX2. > > > > While the benchmark was specifically on ThunderX2, the change once more > > was made for all archs. > > Actually, in that case I did ask for the test to be run on x86 > hardware too, and exactly like you found: > > > I should note in my tests the retry limit was never reached fwiw. > > the max loop retry number just isn't an issue. It fundamentally only > affects extremely unfair platforms, so it's arguably always the right > thing to do. > > So it may be "ThunderX2 specific" in that that is where it was > noticed, but I think we can safely just consider the max loop thing to > be a generic safety net that hopefully simply never triggers in > practice on any sane platform. > If there are a lot of threads contending I'm sure x86, POWER, probably most CPUs could quite possibly starve for hundreds if not thousands or more of iterations here. And I'm not really a fan of scattering random implementation specific crutches ad hoc throughout our primitives. At least it could be specific to the arch where it matters. Interesting that improves performance so much though. I wonder why? Hitting the limit will take the lock and that will cause all other CPUs to drop out of the "fast" path so it will degenerate to a spinlock. queued spinlock is pretty scalable but it really shouldn't be more scalable than an atomic OP. I bet this cpu_relax isn't helping, and probably ll/sc implementation of cmpxchg primitive doesn't help either. I reckon if you removed the cpu_relax there, big x86 systems actually wouldn't want that limit because the occasional case where you might hit the limit will switch behaviour to locking and you might not recover back to atomics if there is continued pressure on it. > > All that said, I think the thing to do here is to replace cpu_relax > > with a dedicated arch-dependent macro, akin to the following: > > I would actually prefer just removing it entirely and see if somebody > else hollers. You have the numbers to prove it hurts on real hardware, > and I don't think we have any numbers to the contrary. > > So I think it's better to trust the numbers and remove it as a > failure, than say "let's just remove it on x86-64 and leave everybody > else with the potentially broken code" Agree. powerpc almost certainly doesn't want it, we would indeed hit the issue identified which is that we'd get the line exclusive then go to sleep for a while holding off other snoopers for a bit then probably losing the line. It would be nice if ThunderX2 case could be re-tested without that retry count and without the cpu_relax too. > Because I do think that a cmpxchg loop that updates the value it > compares and exchanges is fundamentally different from a "busy-loop, > trying to read while locked", It is. powerpc (and probably other ll/sc archs) usually doesn't like this style of cmpxchg loop. Firstly, an initial load could bring the line in shared at first, then it goes exclusive with the larx so 2x the transactions, but actually worse because if it's shared it can let other CPUs go shared as well, then the exclusive has to shoot those down, etc. And if the value changes between those two loads, we fail it and try again whereas we could have done the update inside the reserve. The core isn't too happy about seeing a larx when there is already a larx outstanding because it's worried that it has gone out of order and so it doesn't want to let the second larx stamp on the first one potentially before the stcx for the first one has executed. So it can hold the larx until its the oldest instruction waiting for a stcx that never comes. It's possible our cmpxchg primitives would actually do better with a dummy stcx in the failure case to resolve that, but it's not necessarily better. We don't have a light-weight cancel-reservation instruction which is what we'd really want. The joys of ll/sc. Actually what we'd really want is an arch specific implementation of lockref. Thanks, Nick
On Thu, Jan 12, 2023 at 7:12 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > I did not want to make such a change without redoing the ThunderX2 > benchmark, or at least something else arm64-y. I may be able to bench it > tomorrow on whatever arm-y stuff can be found on Amazon's EC2, assuming > no arm64 people show up with their results. I don't think ThunderX2 itself is particularly interesting, but sure, it would be good to have numbers for some modern arm64 cores. The newer Amazon EC2 cores (Graviton 2/3) sound more relevant (or Ampere?) The more different architecture numbers we'd have for that "remove cpu_relax()", the better. Linus
On Thu, Jan 12, 2023 at 9:20 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > Actually what we'd really want is an arch specific implementation of > lockref. The problem is mainly that then you need to generate the asm versions of all those different CMPXCHG_LOOP() variants. They are all fairly simple, though, and it woudln't be hard to make the current lib/lockref.c just be the generic fallback if you don't have an arch-specific one. And even if you do have the arch-specific LL/SC version, you'd still want the generic fallback for the case where a spinlock isn't a single word any more (which happens when the spinlock debugging options are on). Linus
On Fri Jan 13, 2023 at 2:15 PM AEST, Linus Torvalds wrote: > On Thu, Jan 12, 2023 at 9:20 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > > > Actually what we'd really want is an arch specific implementation of > > lockref. > > The problem is mainly that then you need to generate the asm versions > of all those different CMPXCHG_LOOP() variants. > > They are all fairly simple, though, and it woudln't be hard to make > the current lib/lockref.c just be the generic fallback if you don't > have an arch-specific one. Yeah, it doesn't look too onerous so it's probably worth seeing what the code and some numbers look like here. > And even if you do have the arch-specific LL/SC version, you'd still > want the generic fallback for the case where a spinlock isn't a single > word any more (which happens when the spinlock debugging options are > on). You're right, good point. Thanks, Nick
On Fri, 13 Jan 2023 at 01:31, Luck, Tony <tony.luck@intel.com> wrote: > > > Yeah, if it was ia64-only, it's a non-issue these days. It's dead and > > in pure maintenance mode from a kernel perspective (if even that). > > There's not much "simultaneous" in the SMT on ia64. One thread in a > spin loop will hog the core until the h/w switches to the other thread some > number of cycles (hundreds, thousands? I really can remember). So I > was pretty generous with dropping cpu_relax() into any kind of spin loop. > > Is it time yet for: > > $ git rm -r arch/ia64 > Hi Tony, Can I take that as an ack on [0]? The EFI subsystem has evolved substantially over the years, and there is really no way to do any IA64 testing beyond build testing, so from that perspective, dropping it entirely would be welcomed. Thanks, Ard. [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=remove-ia64
On Fri, Jan 13, 2023 at 02:12:50AM +0100, Mateusz Guzik wrote: > On 1/13/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Side note on your access() changes - if it turns out that you can > > remove all the cred games, we should possibly then revert my old > > commit d7852fbd0f04 ("access: avoid the RCU grace period for the > > temporary subjective credentials") which avoided the biggest issue > > with the unnecessary cred switching. > > > > I *think* access() is the only user of that special 'non_rcu' thing, > > but it is possible that the whole 'non_rcu' thing ends up mattering > > for cases where the cred actually does change because euid != uid (ie > > suid programs), so this would need a bit more effort to do performance > > testing on. > > > > I don't think the games are avoidable. For one I found non-root > processes with non-empty cap_effective even on my laptop, albeit I did > not check how often something like this is doing access(). > > Discussion for another time. > > > On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > >> All that said, I think the thing to do here is to replace cpu_relax > >> with a dedicated arch-dependent macro, akin to the following: > > > > I would actually prefer just removing it entirely and see if somebody > > else hollers. You have the numbers to prove it hurts on real hardware, > > and I don't think we have any numbers to the contrary. > > > > So I think it's better to trust the numbers and remove it as a > > failure, than say "let's just remove it on x86-64 and leave everybody > > else with the potentially broken code" > > > [snip] > > Then other architectures can try to run their numbers, and only *if* > > it then turns out that they have a reason to do something else should > > we make this conditional and different on different architectures. > > > > Let's try to keep the code as common as possibly until we have hard > > evidence for special cases, in other words. > > > > I did not want to make such a change without redoing the ThunderX2 > benchmark, or at least something else arm64-y. I may be able to bench it > tomorrow on whatever arm-y stuff can be found on Amazon's EC2, assuming > no arm64 people show up with their results. > > Even then IMHO the safest route is to patch it out on x86-64 and give > other people time to bench their archs as they get around to it, and > ultimately whack the thing if it turns out nobody benefits from it. > I would say beats backpedaling on the removal, but I'm not going to > fight for it. > > That said, does waiting for arm64 numbers and/or producing them for the > removal commit message sound like a plan? If so, I'll post soon(tm). Honestly, I wouldn't worry about us (arm64) here. I don't think any real hardware implements the YIELD instruction (i.e. it behaves as a NOP in practice). The only place I'm aware of where it _does_ something is in QEMU, which was actually the motivation behind having it in cpu_relax() to start with (see 1baa82f48030 ("arm64: Implement cpu_relax as yield")). So, from the arm64 side of the fence, I'm perfectly happy just removing the cpu_relax() calls from lockref. Will
On Thu, Jan 12, 2023 at 06:13:16PM -0600, Linus Torvalds wrote: > On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > To my understanding on said architecture failed cmpxchg still grants you > > exclusive access to the cacheline, making immediate retry preferable > > when trying to inc/dec unless a certain value is found. > > I actually suspect that is _always_ the case - this is not like a > contended spinlock where we want to pause because we're waiting for > the value to change and become unlocked, this cmpxchg loop is likely > always better off just retrying with the new value. > > That said, the "likely always better off" is purely about performance. > > So I have this suspicion that the reason Tony added the cpu_relax() > was simply not about performance, but about other issues, like > fairness in SMT situations. > > That said, evern from a fairness perspective the cpu_relax() sounds a > bit odd and unlikely - we're literally yielding when we lost a race, > so it hurts the _loser_, not the winner, and thus might make fairness > worse too. I've been writing cmpxchg loops that have strict termination conditions without cpu_relax() in them for a while now. For example, the x86 atomic_fetch_and() implementation looks like so: static __always_inline int arch_atomic_fetch_and(int i, atomic_t *v) { int val = arch_atomic_read(v); do { } while (!arch_atomic_try_cmpxchg(v, &val, val & i)); return val; } And I did that because of the exact same argument you had above, it needs to do the op anyway, waiting between failed attempts will only increase the chance it will fail again.
>> Is it time yet for: >> >> $ git rm -r arch/ia64 >> > Can I take that as an ack on [0]? The EFI subsystem has evolved > substantially over the years, and there is really no way to do any > IA64 testing beyond build testing, so from that perspective, dropping > it entirely would be welcomed. > > Thanks, > Ard. > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=remove-ia64 Yes. EFI isn't the only issue. A bunch of folks[1] have spent time fixing ia64 for (in most cases) some tree-wide patches that they needed. Their time might have been more productively spent fixing things that actually matter in modern times. Acked-by: Tony Luck <tony.luck@intel.com> -Tony [1] git log --no-merges --since=2year -- arch/ia64 | grep Author: | sort | uniq -c | sort -rn 19 Author: Masahiro Yamada <masahiroy@kernel.org> 11 Author: Sergei Trofimovich <slyfox@gentoo.org> 9 Author: Eric W. Biederman <ebiederm@xmission.com> 8 Author: Arnd Bergmann <arnd@arndb.de> 6 Author: Randy Dunlap <rdunlap@infradead.org> 6 Author: Kefeng Wang <wangkefeng.wang@huawei.com> 6 Author: Anshuman Khandual <anshuman.khandual@arm.com> 5 Author: Masami Hiramatsu <mhiramat@kernel.org> 5 Author: Al Viro <viro@zeniv.linux.org.uk> 4 Author: Peter Zijlstra <peterz@infradead.org> 4 Author: Mike Rapoport <rppt@kernel.org> 4 Author: David Hildenbrand <david@redhat.com> 4 Author: Christophe Leroy <christophe.leroy@csgroup.eu> 3 Author: Yury Norov <yury.norov@gmail.com> 3 Author: Michal Hocko <mhocko@suse.com> 3 Author: Geert Uytterhoeven <geert+renesas@glider.be> 3 Author: Bhaskar Chowdhury <unixbhaskar@gmail.com> 3 Author: Baoquan He <bhe@redhat.com> 3 Author: Ard Biesheuvel <ardb@kernel.org> 3 Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> 2 Author: Yang Guang <yang.guang5@zte.com.cn> 2 Author: Will Deacon <will@kernel.org> 2 Author: Viresh Kumar <viresh.kumar@linaro.org> 2 Author: Valentin Schneider <vschneid@redhat.com> 2 Author: Stafford Horne <shorne@gmail.com> 2 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> 2 Author: Richard Guy Briggs <rgb@redhat.com> 2 Author: Peter Xu <peterx@redhat.com> 2 Author: Peter Collingbourne <pcc@google.com> 2 Author: Mark Rutland <mark.rutland@arm.com> 2 Author: Lukas Bulwahn <lukas.bulwahn@gmail.com> 2 Author: Julia Lawall <Julia.Lawall@inria.fr> 2 Author: Jens Axboe <axboe@kernel.dk> 2 Author: Jason Wang <wangborong@cdjrlc.com> 2 Author: Jan Kara <jack@suse.cz> 2 Author: Christoph Hellwig <hch@lst.de> 2 Author: Bjorn Helgaas <bhelgaas@google.com> 2 Author: Alexander Lobakin <alexandr.lobakin@intel.com> 1 Author: Zi Yan <ziy@nvidia.com> 1 Author: Zhang Yunkai <zhang.yunkai@zte.com.cn> 1 Author: ye xingchen <ye.xingchen@zte.com.cn> 1 Author: xu xin <xu.xin16@zte.com.cn> 1 Author: Wolfram Sang <wsa+renesas@sang-engineering.com> 1 Author: Weizhao Ouyang <o451686892@gmail.com> 1 Author: Suren Baghdasaryan <surenb@google.com> 1 Author: Souptick Joarder (HPE) <jrdr.linux@gmail.com> 1 Author: Sergey Shtylyov <s.shtylyov@omp.ru> 1 Author: Sergei Trofimovich <slyich@gmail.com> 1 Author: Sascha Hauer <s.hauer@pengutronix.de> 1 Author: Samuel Holland <samuel@sholland.org> 1 Author: Qi Zheng <zhengqi.arch@bytedance.com> 1 Author: Peng Liu <liupeng256@huawei.com> 1 Author: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> 1 Author: Muchun Song <muchun.song@linux.dev> 1 Author: Mikulas Patocka <mpatocka@redhat.com> 1 Author: Mike Kravetz <mike.kravetz@oracle.com> 1 Author: Mickaël Salaün <mic@linux.microsoft.com> 1 Author: Matthew Wilcox (Oracle) <willy@infradead.org> 1 Author: Martin Oliveira <martin.oliveira@eideticom.com> 1 Author: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> 1 Author: Kees Cook <keescook@chromium.org> 1 Author: Jason A. Donenfeld <Jason@zx2c4.com> 1 Author: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 1 Author: Haowen Bai <baihaowen@meizu.com> 1 Author: Gustavo A. R. Silva <gustavoars@kernel.org> 1 Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 1 Author: Gaosheng Cui <cuigaosheng1@huawei.com> 1 Author: Dmitry Osipenko <dmitry.osipenko@collabora.com> 1 Author: Dawei Li <set_pte_at@outlook.com> 1 Author: Chuck Lever <chuck.lever@oracle.com> 1 Author: Christian Brauner <brauner@kernel.org> 1 Author: Chris Down <chris@chrisdown.name> 1 Author: Chen Li <chenli@uniontech.com> 1 Author: Catalin Marinas <catalin.marinas@arm.com> 1 Author: Benjamin Stürz <benni@stuerz.xyz> 1 Author: Ben Dooks <ben-linux@fluff.org> 1 Author: Baolin Wang <baolin.wang@linux.alibaba.com> 1 Author: Andy Shevchenko <andriy.shevchenko@linux.intel.com> 1 Author: André Almeida <andrealmeid@igalia.com>
On Fri, Jan 13, 2023 at 08:55:41AM +0100, Ard Biesheuvel wrote: > On Fri, 13 Jan 2023 at 01:31, Luck, Tony <tony.luck@intel.com> wrote: > > > > > Yeah, if it was ia64-only, it's a non-issue these days. It's dead and > > > in pure maintenance mode from a kernel perspective (if even that). > > > > There's not much "simultaneous" in the SMT on ia64. One thread in a > > spin loop will hog the core until the h/w switches to the other thread some > > number of cycles (hundreds, thousands? I really can remember). So I > > was pretty generous with dropping cpu_relax() into any kind of spin loop. > > > > Is it time yet for: > > > > $ git rm -r arch/ia64 > > > > Hi Tony, > > Can I take that as an ack on [0]? The EFI subsystem has evolved > substantially over the years, and there is really no way to do any > IA64 testing beyond build testing, so from that perspective, dropping > it entirely would be welcomed. For what it's worth, Debian and Gentoo both have ia64 ports with active users (6.1 looks like it currently fails to build in Debian due to a minor packaging issue, but various versions of 6.0 were built and published, and one of those is running on the one ia64 Debian builder I personally have access to). Jess
> For what it's worth, Debian and Gentoo both have ia64 ports with active > users (6.1 looks like it currently fails to build in Debian due to a > minor packaging issue, but various versions of 6.0 were built and > published, and one of those is running on the one ia64 Debian builder I > personally have access to). Jess, So dropping ia64 from the upstream kernel won't just save time of kernel developers. It will also save time for the folks keeping Debian and Gentoo ports up and running. Are there people actually running production systems on ia64 that also update to v6.x kernels? If so, why? Just scrap the machine and replace with almost anything else. You'll cover the cost of the new machine in short order with the savings on your power bill. -Tony
On 13 Jan 2023, at 21:03, Luck, Tony <tony.luck@intel.com> wrote: > >> For what it's worth, Debian and Gentoo both have ia64 ports with active >> users (6.1 looks like it currently fails to build in Debian due to a >> minor packaging issue, but various versions of 6.0 were built and >> published, and one of those is running on the one ia64 Debian builder I >> personally have access to). > > Jess, > > So dropping ia64 from the upstream kernel won't just save time of kernel > developers. It will also save time for the folks keeping Debian and Gentoo > ports up and running. > > Are there people actually running production systems on ia64 that also > update to v6.x kernels? > > If so, why? Just scrap the machine and replace with almost anything else. > You'll cover the cost of the new machine in short order with the savings on > your power bill. Hobbyists, same as alpha, hppa, m68k, sh and any other such architectures that have no real use in this day and age. Jess
Hello Ard! > Can I take that as an ack on [0]? The EFI subsystem has evolved > substantially over the years, and there is really no way to do any > IA64 testing beyond build testing, so from that perspective, dropping > it entirely would be welcomed. ia64 is regularly tested in Debian and Gentoo [1][2]. Debian's ia64 porterbox yttrium runs a recent kernel without issues: root@yttrium:~# uname -a Linux yttrium 5.19.0-2-mckinley #1 SMP Debian 5.19.11-1 (2022-09-24) ia64 GNU/Linux root@yttrium:~# root@yttrium:~# journalctl -b|head -n10 Nov 14 14:46:10 yttrium kernel: Linux version 5.19.0-2-mckinley (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.3.0-6) 11.3.0, GNU ld (GNU Binutils for Debian) 2.39) #1 SMP Debian 5.19.11-1 (2022-09-24) Nov 14 14:46:10 yttrium kernel: efi: EFI v2.10 by HP Nov 14 14:46:10 yttrium kernel: efi: SALsystab=0xdfdd63a18 ESI=0xdfdd63f18 ACPI 2.0=0x3d3c4014 HCDP=0xdffff8798 SMBIOS=0x3d368000 Nov 14 14:46:10 yttrium kernel: PCDP: v3 at 0xdffff8798 Nov 14 14:46:10 yttrium kernel: earlycon: uart8250 at I/O port 0x4000 (options '115200n8') Nov 14 14:46:10 yttrium kernel: printk: bootconsole [uart8250] enabled Nov 14 14:46:10 yttrium kernel: ACPI: Early table checksum verification disabled Nov 14 14:46:10 yttrium kernel: ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP ) Nov 14 14:46:10 yttrium kernel: ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP RX2800-2 00000001 01000013) Nov 14 14:46:10 yttrium kernel: ACPI: FACP 0x000000003D3BE000 0000F4 (v03 HP RX2800-2 00000001 HP 00000001) root@yttrium:~# Same applies to the buildds: root@lifshitz:~# uname -a Linux lifshitz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux root@lifshitz:~# root@lenz:~# uname -a Linux lenz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux root@lenz:~# EFI works fine as well using the latest version of GRUB2. Thanks, Adrian > [1] https://cdimage.debian.org/cdimage/ports/snapshots/ > [2] https://mirror.yandex.ru/gentoo-distfiles//releases/ia64/autobuilds/
On Fri, 13 Jan 2023 at 22:06, John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> wrote: > > Hello Ard! > > > Can I take that as an ack on [0]? The EFI subsystem has evolved > > substantially over the years, and there is really no way to do any > > IA64 testing beyond build testing, so from that perspective, dropping > > it entirely would be welcomed. > > ia64 is regularly tested in Debian and Gentoo [1][2]. > > Debian's ia64 porterbox yttrium runs a recent kernel without issues: > > root@yttrium:~# uname -a > Linux yttrium 5.19.0-2-mckinley #1 SMP Debian 5.19.11-1 (2022-09-24) ia64 GNU/Linux > root@yttrium:~# > > root@yttrium:~# journalctl -b|head -n10 > Nov 14 14:46:10 yttrium kernel: Linux version 5.19.0-2-mckinley (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.3.0-6) 11.3.0, GNU ld (GNU Binutils for Debian) 2.39) #1 SMP Debian 5.19.11-1 (2022-09-24) > Nov 14 14:46:10 yttrium kernel: efi: EFI v2.10 by HP > Nov 14 14:46:10 yttrium kernel: efi: SALsystab=0xdfdd63a18 ESI=0xdfdd63f18 ACPI 2.0=0x3d3c4014 HCDP=0xdffff8798 SMBIOS=0x3d368000 > Nov 14 14:46:10 yttrium kernel: PCDP: v3 at 0xdffff8798 > Nov 14 14:46:10 yttrium kernel: earlycon: uart8250 at I/O port 0x4000 (options '115200n8') > Nov 14 14:46:10 yttrium kernel: printk: bootconsole [uart8250] enabled > Nov 14 14:46:10 yttrium kernel: ACPI: Early table checksum verification disabled > Nov 14 14:46:10 yttrium kernel: ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP ) > Nov 14 14:46:10 yttrium kernel: ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP RX2800-2 00000001 01000013) > Nov 14 14:46:10 yttrium kernel: ACPI: FACP 0x000000003D3BE000 0000F4 (v03 HP RX2800-2 00000001 HP 00000001) > root@yttrium:~# > > Same applies to the buildds: > > root@lifshitz:~# uname -a > Linux lifshitz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux > root@lifshitz:~# > > root@lenz:~# uname -a > Linux lenz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux > root@lenz:~# > > EFI works fine as well using the latest version of GRUB2. > > Thanks, > Adrian > > > [1] https://cdimage.debian.org/cdimage/ports/snapshots/ > > [2] https://mirror.yandex.ru/gentoo-distfiles//releases/ia64/autobuilds/ Thanks for reporting back. I (mis)read the debian ports page [3], which mentions Debian 7 as the highest Debian version that supports IA64, and so I assumed that support had been dropped from Debian. However, if only a handful of people want to keep this port alive for reasons of nostalgia, it is obviously obsolete, and we should ask ourselves whether it is reasonable to expect Linux contributors to keep spending time on this. Does the Debian ia64 port have any users? Or is the system that builds the packages the only one that consumes them? [3] https://www.debian.org/ports/ia64/
On Sat, Jan 14, 2023 at 12:43 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > On Fri, 13 Jan 2023 at 22:06, John Paul Adrian Glaubitz > <glaubitz@physik.fu-berlin.de> wrote: > > > > Hello Ard! > > > > > Can I take that as an ack on [0]? The EFI subsystem has evolved > > > substantially over the years, and there is really no way to do any > > > IA64 testing beyond build testing, so from that perspective, dropping > > > it entirely would be welcomed. > > > > ia64 is regularly tested in Debian and Gentoo [1][2]. > > > > Debian's ia64 porterbox yttrium runs a recent kernel without issues: > > > > root@yttrium:~# uname -a > > Linux yttrium 5.19.0-2-mckinley #1 SMP Debian 5.19.11-1 (2022-09-24) ia64 GNU/Linux > > root@yttrium:~# > > > > root@yttrium:~# journalctl -b|head -n10 > > Nov 14 14:46:10 yttrium kernel: Linux version 5.19.0-2-mckinley (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.3.0-6) 11.3.0, GNU ld (GNU Binutils for Debian) 2.39) #1 SMP Debian 5.19.11-1 (2022-09-24) > > Nov 14 14:46:10 yttrium kernel: efi: EFI v2.10 by HP > > Nov 14 14:46:10 yttrium kernel: efi: SALsystab=0xdfdd63a18 ESI=0xdfdd63f18 ACPI 2.0=0x3d3c4014 HCDP=0xdffff8798 SMBIOS=0x3d368000 > > Nov 14 14:46:10 yttrium kernel: PCDP: v3 at 0xdffff8798 > > Nov 14 14:46:10 yttrium kernel: earlycon: uart8250 at I/O port 0x4000 (options '115200n8') > > Nov 14 14:46:10 yttrium kernel: printk: bootconsole [uart8250] enabled > > Nov 14 14:46:10 yttrium kernel: ACPI: Early table checksum verification disabled > > Nov 14 14:46:10 yttrium kernel: ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP ) > > Nov 14 14:46:10 yttrium kernel: ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP RX2800-2 00000001 01000013) > > Nov 14 14:46:10 yttrium kernel: ACPI: FACP 0x000000003D3BE000 0000F4 (v03 HP RX2800-2 00000001 HP 00000001) > > root@yttrium:~# > > > > Same applies to the buildds: > > > > root@lifshitz:~# uname -a > > Linux lifshitz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux > > root@lifshitz:~# > > > > root@lenz:~# uname -a > > Linux lenz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux > > root@lenz:~# > > > > EFI works fine as well using the latest version of GRUB2. > > > > Thanks, > > Adrian > > > > > [1] https://cdimage.debian.org/cdimage/ports/snapshots/ > > > [2] https://mirror.yandex.ru/gentoo-distfiles//releases/ia64/autobuilds/ > > Thanks for reporting back. I (mis)read the debian ports page [3], > which mentions Debian 7 as the highest Debian version that supports > IA64, and so I assumed that support had been dropped from Debian. > > However, if only a handful of people want to keep this port alive for > reasons of nostalgia, it is obviously obsolete, and we should ask > ourselves whether it is reasonable to expect Linux contributors to > keep spending time on this. > > Does the Debian ia64 port have any users? Or is the system that builds > the packages the only one that consumes them? > > > [3] https://www.debian.org/ports/ia64/ I have no IA64 hardware or be a user of it or have any strong feelings to keep this arch in the Linux-kernel. But I am a Debianist (Debian/unstable AMD64 user). Best is to ask the Debian release-team or (if there exist) maintainers or responsibles for the IA64 port - which is an ***unofficial*** port. What I found... on <cdimage.debian.org>: https://cdimage.debian.org/cdimage/ > Ports https://cdimage.debian.org/cdimage/ports/current/ https://cdimage.debian.org/cdimage/ports/current-debian-installer/ia64/debian-installer-images_20211020_ia64.tar.gz ^^ Last modified: 2021-10-20 22:52 https://cdimage.debian.org/cdimage/ports/current/debian-11.0.0-ia64-NETINST-1.iso ^^ Last modofied: 2022-03-28 14:18 With a net-install image you should be able to setup and explore the IA64 Debian cosmos. Example #1: binutils packages Checking available binutils package for Debian/unstable IA64 (version: 2.39.90.20230110-1): https://packages.debian.org/sid/binutils <--- Clearly states IA64 as "unofficial port" https://packages.debian.org/sid/ia64/binutils/filelist Example #2: linux-image packages Cannot say what this means... https://packages.debian.org/search?arch=amd64&keywords=linux-image (AMD64 - matches) https://packages.debian.org/search?arch=ia64&keywords=linux-image (IA64 - no matches) https://packages.debian.org/search?arch=ia64&keywords=linux (IA64 - matches - but no linux-image which ships normally a bootable Linux-kernel) As stated I have no expertise in Debian whatever release for IA64 arch. Hope that helps. -Sedat-
[ ... ] > Best is to ask the Debian release-team or (if there exist) maintainers > or responsibles for the IA64 port - which is an ***unofficial*** port. > Here we go: https://lists.debian.org/debian-ia64/ Posting address: debian-ia64@lists.debian.org Found via <https://lists.debian.org/completeindex.html> -Sedat-
On Sat, Jan 14, 2023 at 12:28:30PM +0100, Sedat Dilek wrote: > [ ... ] > > > Best is to ask the Debian release-team or (if there exist) maintainers > > or responsibles for the IA64 port - which is an ***unofficial*** port. > > > > Here we go: > > https://lists.debian.org/debian-ia64/ > > Posting address: debian-ia64@lists.debian.org > > Found via <https://lists.debian.org/completeindex.html> More useful perhaps is to look at https://popcon.debian.org/ There are three machines reporting popcon results. It's dead.
On Sun, Jan 15, 2023 at 1:27 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Jan 14, 2023 at 12:28:30PM +0100, Sedat Dilek wrote: > > [ ... ] > > > > > Best is to ask the Debian release-team or (if there exist) maintainers > > > or responsibles for the IA64 port - which is an ***unofficial*** port. > > > > > > > Here we go: > > > > https://lists.debian.org/debian-ia64/ > > > > Posting address: debian-ia64@lists.debian.org > > > > Found via <https://lists.debian.org/completeindex.html> > > More useful perhaps is to look at https://popcon.debian.org/ > > There are three machines reporting popcon results. It's dead. Exactly, Debian Popularity Contest was what I was looking for yesterday. Thanks Matthew. [1] says in Inst (204701): Name || Number || % ================================== binutils-x86-64-linux-gnu || 101548 || 49.61% binutils-ia64-linux-gnu || 11 || 0.01% HELP: Inst. is the number of people who installed this package (sum of the four categories below) There may be more popular packages than binutils. ( binutils might tell something about development happening or not. ) Anyway, I am not a popcon expert and never participated in Debian's Popularity Contest. -Sedat- [1] https://qa.debian.org/popcon.php?package=binutils
Hi Ard! On 1/14/23 00:25, Ard Biesheuvel wrote: > Thanks for reporting back. I (mis)read the debian ports page [3], > which mentions Debian 7 as the highest Debian version that supports > IA64, and so I assumed that support had been dropped from Debian. This page talks about officially supported ports. Debian Ports is an unofficial spin maintained by a number of Debian Developers and external developers that are volunteering to maintain these ports. > However, if only a handful of people want to keep this port alive for > reasons of nostalgia, it is obviously obsolete, and we should ask > ourselves whether it is reasonable to expect Linux contributors to > keep spending time on this. You could say this about a lot of hardware, can't you? > Does the Debian ia64 port have any users? Or is the system that builds > the packages the only one that consumes them? There is the popcon statistics. However, that is opt-on and the numbers are not really trustworthy. We are getting feedback from time to time from people using it. Is there any problem with the ia64 port at the moment that would justify removal? Adrian
On 1/14/23 12:24, Sedat Dilek wrote: > Example #1: binutils packages > > Checking available binutils package for Debian/unstable IA64 (version: > 2.39.90.20230110-1): > > https://packages.debian.org/sid/binutils <--- Clearly states IA64 as > "unofficial port" And? > https://packages.debian.org/sid/ia64/binutils/filelist > > Example #2: linux-image packages > > Cannot say what this means... > > https://packages.debian.org/search?arch=amd64&keywords=linux-image > (AMD64 - matches) > > https://packages.debian.org/search?arch=ia64&keywords=linux-image > (IA64 - no matches) > > https://packages.debian.org/search?arch=ia64&keywords=linux (IA64 - > matches - but no linux-image which ships normally a bootable > Linux-kernel) No, the package is called "linux-image-$ARCH-$FLAVOR". There is no "linux-image" package for amd64 either. Does that prove amd64 is dead? What you are looking for can be found here: > http://ftp.ports.debian.org/debian-ports/pool-ia64/main/l/linux/ > As stated I have no expertise in Debian whatever release for IA64 arch. Well, maybe let me answer the questions then since I am maintaining the port in Debian. Adrian
On 1/14/23 12:28, Sedat Dilek wrote: > [ ... ] > >> Best is to ask the Debian release-team or (if there exist) maintainers >> or responsibles for the IA64 port - which is an ***unofficial*** port. >> > > Here we go: > > https://lists.debian.org/debian-ia64/ > > Posting address: debian-ia64@lists.debian.org > > Found via <https://lists.debian.org/completeindex.html> And the wiki lists Jason Duerstock, Jessica Clarke and me as maintainers: > https://wiki.debian.org/Ports/ia64 Adrian
On 1/15/23 01:27, Matthew Wilcox wrote: > More useful perhaps is to look at https://popcon.debian.org/ > > There are three machines reporting popcon results. It's dead. It's an opt-in mechanism that reports 190,000 machines running Debian on x86_64. Do you think that there are only 190,000 servers world-wide running Debian? Adrian
On 1/15/23 13:04, Sedat Dilek wrote: > Exactly, Debian Popularity Contest was what I was looking for yesterday. > > Thanks Matthew. > > [1] says in Inst (204701): > > Name || Number || % > ================================== > binutils-x86-64-linux-gnu || 101548 || 49.61% > binutils-ia64-linux-gnu || 11 || 0.01% > > HELP: Inst. is the number of people who installed this package (sum of > the four categories below) > > There may be more popular packages than binutils. > ( binutils might tell something about development happening or not. ) > > Anyway, I am not a popcon expert and never participated in Debian's > Popularity Contest. And that's the point, it's opt-in! Adrian
On Mon, 16 Jan 2023 at 10:33, John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> wrote: > > Hi Ard! > > On 1/14/23 00:25, Ard Biesheuvel wrote: > > Thanks for reporting back. I (mis)read the debian ports page [3], > > which mentions Debian 7 as the highest Debian version that supports > > IA64, and so I assumed that support had been dropped from Debian. > > This page talks about officially supported ports. Debian Ports is an > unofficial spin maintained by a number of Debian Developers and external > developers that are volunteering to maintain these ports. > > > However, if only a handful of people want to keep this port alive for > > reasons of nostalgia, it is obviously obsolete, and we should ask > > ourselves whether it is reasonable to expect Linux contributors to > > keep spending time on this. > > You could say this about a lot of hardware, can't you? > Uhm, yes. Linux contributor effort is a scarce resource, and spending it on architectures that nobody actually uses, such as alpha or ia64, means it is not spent on things that are useful to more people. I really do sympathize with the enthusiast/hobbyist PoV - I am also an engineer that likes to tinker. So 'use' can be defined liberally here, and cover running the latest Linux on ancient hardware just for entertainment. However, the question is not how you or I choose to spend (or waste) their time. The question is whether it is reasonable *as a community* to insist that everyone who contributes a cross-architecture change also has to ensure that obsolete architectures such as i64 or alpha are not left behind. The original thread is an interesting example here - removing a cpu_relax() in cmpxchg() that was only there because of IA64's clunky SMT implementation. Perhaps this means that IA64 performance is going to regress substantially for some workloads? Should anyone care? Should we test such changes first? And how should we do that if there is no maintainer and nobody has access to the hardware? The other example is EFI, which i maintain. Should I require from contributors that they build and boot test EFI changes on ia64 if I myself don't even have access to the hardware? It is good to know that things don't seem to be broken today, but if it is going to fall over, it may take a while before anybody notices. What happens then? > > Does the Debian ia64 port have any users? Or is the system that builds > > the packages the only one that consumes them? > > There is the popcon statistics. However, that is opt-on and the numbers are > not really trustworthy. We are getting feedback from time to time from people > using it. > > Is there any problem with the ia64 port at the moment that would justify removal? > I would argue that we should mark it obsolete at the very least, so that it is crystal clear that regressing IA64 (either knowingly or unknowingly) by a generic or cross-architecture change is not a showstopper, even at build time. Then, if someone has the skill set and the time on their hands, as well as access to actual hardware, they can keep it alive if they want to.
On Mon, Jan 16, 2023 at 10:41:14AM +0100, John Paul Adrian Glaubitz wrote: > On 1/15/23 01:27, Matthew Wilcox wrote: > > More useful perhaps is to look at https://popcon.debian.org/ > > > > There are three machines reporting popcon results. It's dead. > > It's an opt-in mechanism that reports 190,000 machines running Debian > on x86_64. Do you think that there are only 190,000 servers world-wide > running Debian? No. I think there are so few ia64 machines still in operation that the effort required to continue to support them is now too high relative to the benefits. We've dropped support for hardware that still exists before, and we'll do it again. The only question is when. I still have two ia64 machines in my basement. I've turned one of them on once since 2009. And that was because I had a bug I needed to track down and fix (2018, commits 4b664e739f77 and 2879b65f9de8).
Hi Linus, I'm not sure how relevant it is to the topic, but I seem to remember you having a go at implementing spinlocks with x86_64 memory transaction instructions a while back. Do you have any thoughts on whether these instructions are ever likely to become something we can use? I was looking specifically at the skbuff queue stuff which does { spin_lock_irq, add to list, inc count, spin_unlock_irq } and thinking that might be a good place to use such a thing. David
On Mon, Jan 16, 2023 at 02:08:15PM +0000, David Howells wrote: > Hi Linus, > > I'm not sure how relevant it is to the topic, but I seem to remember you > having a go at implementing spinlocks with x86_64 memory transaction > instructions a while back. Do you have any thoughts on whether these > instructions are ever likely to become something we can use? Ever is a long time, but not while they're still buggy: https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions and not while they're not actually available on a vast majority of x86 hardware. ie AMD needs to implement them, make them available as standard, forcing Intel to enable them globally instead of restricting them to those who pay the $2.50/month subscription fee.
On Mon, Jan 16, 2023 at 6:08 AM David Howells <dhowells@redhat.com> wrote: > > I'm not sure how relevant it is to the topic, but I seem to remember you > having a go at implementing spinlocks with x86_64 memory transaction > instructions a while back. Do you have any thoughts on whether these > instructions are ever likely to become something we can use? Nope. Not only are they buggy, the actual cost of them was prohibitive too. The only implementation I ever did was what then became 'lockref' (so this was about a decade ago), and using transactional memory was actually quite noticeably *slower* than just using a spinlock in the common cases (particularly the uncontended case - which is actually the common one by far, despite some benchmarks). And while the argument could be that transactional memory has improved in the last decade (narrator: "It hasn't"), the problem is actually quite fundamental. The whole "best case scenario" for transactional memory concept is literally optimized and designed for the rare circumstance - the case where you have contention, but where the locking part is basically an idempotent operation (so "lock+unlock" ends up undoing each other and can use shared cachelines rather than bouncing the cacheline). And contention pretty much happens in one situation, and one situation only: in a combination of (a) bad locking and (b) benchmarks. And for the kernel, where we don't have bad locking, and where we actually use fine-grained locks that are _near_ the data that we are locking (the lockref of the dcache is obviously one example of that, but the skbuff queue you mention is almost certainly exactly the same situation): the lock is right by the data that the lock protects, and the "shared lock cacheline" model simply does not work. You'll bounce the data, and most likely you'll also touch the same lock cacheline too. I was quite excited about transactional memory, but have (obviously) come to the conclusion that it was one of those "looks good in theory, but is just garbage in reality". So this isn't an "the x86 implementation was bad" issue. This is a "transactional memory is a bad idea" situation. It complicates the most important part of the core CPU (the memory pipeline) enormously, and it doesn't actually really work. Now, with that "transactional memory is garbage" in mind, let me just mention that there are obviously other circumstances: (a) horrendously bad locking If you have a single centralized lock, and you really have serious contention on it in real life all the time (and not just under microbenchmarks), and the data that gets touched is spread out all over, and you simply cannot fix the locking, then all those theoretical advantages of transactional memory can actually come into play. But again: even then, there's an absolutely huge cost to the memory pipeline. So that advantage really doesn't come free. Nobody has ever gotten transactional memory to actually work well in real life. Not Intel, not IBM with powerpc HTM. That should tell you something. The hw complexity cost is very very real. But Intel actually had some loads where TSX helped. I think it was SAP HANA, and I really think that what you should take away from that is that SAP HANA locking is likely horrible garbage inside. But maybe I'm biased, and maybe SAP HANA is lovely, and maybe it just happened to work really well for TSX. I've never actually looked at that load, but if I were a betting man... (b) very *limited* transactions are probably a great idea LL/SC is arguably a form of "transactional memory", just with a single word. Now, I think atomics are generally usually better than LL/SC when it really is just a single word (because most of the time, "single word" also means "simple semantics", and while CMPXCHG loops aren't lovely when the semantics are a bit more complex, it's actually nice and portable and works at a higher level than assembler sequences and as such is actually a *lot* better than LL/SC in practice). But a "N-word LL/SC with forward progress guarantees" would be objectively stronger than atomics, and would allow for doing low-level data structures in ways that atomics don't. Doubly linked lists together with a lock check would argue for "N" being ~5 memory operations. So I do believe in potentially *limited* memory transactions, when they can be made limited enough that you have (1) forward progress guarantees with no "separate fallback on contention" garbage (2) can keep the hw implementation simple enough with just a store buffer and an (architecturally) very limited number of cacheline accesses that you can actually order in HW so that you don't have ABBA situations. But the generic kind of transactional memory that Intel tried with TSX (and IBM with HTM) is pure garbage, and almost certainly always will be. And all of the above is obviously just my opinionated input on it. But I just happen to be right. Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > And for the kernel, where we don't have bad locking, and where we > actually use fine-grained locks that are _near_ the data that we are > locking (the lockref of the dcache is obviously one example of that, > but the skbuff queue you mention is almost certainly exactly the same > situation): the lock is right by the data that the lock protects, and > the "shared lock cacheline" model simply does not work. You'll bounce > the data, and most likely you'll also touch the same lock cacheline > too. Yeah. The reason I was actually wondering about them was if it would be possible to avoid the requirement to disable interrupts/softirqs to, say, modify the skbuff queue. On some arches actually disabling irqs is quite a heavy operation (I think this is/was true on ppc64, for example; it certainly was on frv) and it was necessary to "emulate" the disablement. David
On Wed Jan 18, 2023 at 7:05 PM AEST, David Howells wrote: > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > And for the kernel, where we don't have bad locking, and where we > > actually use fine-grained locks that are _near_ the data that we are > > locking (the lockref of the dcache is obviously one example of that, > > but the skbuff queue you mention is almost certainly exactly the same > > situation): the lock is right by the data that the lock protects, and > > the "shared lock cacheline" model simply does not work. You'll bounce > > the data, and most likely you'll also touch the same lock cacheline > > too. > > Yeah. The reason I was actually wondering about them was if it would be > possible to avoid the requirement to disable interrupts/softirqs to, say, > modify the skbuff queue. On some arches actually disabling irqs is quite a > heavy operation (I think this is/was true on ppc64, for example; it certainly > was on frv) and it was necessary to "emulate" the disablement. Not too bad on modern ppc64. Changing MSR in general has to flush the pipe and even re-fetch, because it can alter memory translation among other things, so it was heavy. Everything we support has a lightweight MSR change that just modifies the interrupt enable bit and only needs minor serialisation (although we still have that software-irq-disable thing which avoids the heavy MSR problem on old CPUs). Thanks, Nick
diff --git a/lib/lockref.c b/lib/lockref.c index 45e93ece8ba0..e057e1630e7c 100644 --- a/lib/lockref.c +++ b/lib/lockref.c @@ -2,6 +2,10 @@ #include <linux/export.h> #include <linux/lockref.h> +#ifndef arch_cpu_relax_cmpxchg_loop +#define arch_cpu_relax_cmpxchg_loop() cpu_relax() +#endif + #if USE_CMPXCHG_LOCKREF /* @@ -23,7 +27,7 @@ } \ if (!--retry) \ break; \ - cpu_relax(); \