lockref scalability on x86-64 vs cpu_relax

Message ID	CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@vger.kernel.org> MIME-Version: 1.0 From: Mateusz Guzik <mjguzik@gmail.com> Date: Fri, 13 Jan 2023 00:36:55 +0100 Message-ID: <CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com> Subject: lockref scalability on x86-64 vs cpu_relax To: linux-fsdevel@vger.kernel.org Cc: viro@zeniv.linux.org.uk, torvalds@linux-foundation.org, Jan Glauber <jan.glauber@gmail.com>, tony.luck@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	lockref scalability on x86-64 vs cpu_relax \| expand lockref scalability on x86-64 vs cpu_relax

Mateusz Guzik Jan. 12, 2023, 11:36 p.m. UTC

Hello,

[cc is probably woefully incomplete, just grabbed people from lockref
history; should this land on a x86 list instead of vfs?]

I intended to send a patch which fixes cred-related bottleneck in
access(), and while getting the expected win for calls with different
files, I got a *slowdown* when benchmarking against the same file and
according to perf top it was all lockref. I'm going to post it after
this issue is resolved, interested parties can take a peek here:
https://dpaste.com/8SVDF8HJH .

The problem is visible with open3 test ("Same file open/close") from
will-it-scale. I ran the _processes variant against stock + no-pause
kernel on Cascade Lake (2 sockets * 24 cores * 2 threads) running
6.2-rc3.

Results are:
proc    stock   no-pause
1       805603  814942
2       1054980 1054781
8       1544802 1822858
24      1191064 2199665
48      851582  1469860
96      609481  1427170

As you can see degradation already shows up at ~8 workers.

While trying to do my homework regarding history in the area I found
this thread:
https://lkml.iu.edu/hypermail/linux/kernel/1309.0/02330.html

It mentions a stat-based test, which I presume was multithreaded
stat on the same dentry. will-it-scale somehow does not have stat
benches, so I posted a PR to add some:
https://github.com/antonblanchard/will-it-scale/pull/35/files

With fstat2_processes (Same file fstat) I get:
proc    stock   no-pause
1       3013872 3047636
2       4284687 4400421
8       3257721 5530156
24      2239819 5466127
48      1701072 5256609
96      1269157 6649326

To my understanding on said architecture failed cmpxchg still grants you
exclusive access to the cacheline, making immediate retry preferable
when trying to inc/dec unless a certain value is found. By doing pause
instead one not only induces a delay, but also increases likelihood that
the line will have to be grabbed E again. Something to that extent was
even stated in thread and it definitely lines up with results above.

I see pause first shoed up first here:
commit d472d9d98b463dd7a04f2bcdeafe4261686ce6ab
Author: Tony Luck <tony.luck@intel.com>
Date:   Tue Sep 3 14:49:49 2013 -0700

    lockref: Relax in cmpxchg loop

... without numbers attached to it. Given the above linked thread it
looks like the arch this was targeting was itanium, not x86-64, but
the change landed for everyone.

Later it was further augmented with:
commit 893a7d32e8e04ca4d6c882336b26ed660ca0a48d
Author: Jan Glauber <jan.glauber@gmail.com>
Date:   Wed Jun 5 15:48:49 2019 +0200

    lockref: Limit number of cmpxchg loop retries
[snip]
    With the retry limit the performance of an open-close testcase
    improved between 60-70% on ThunderX2.

While the benchmark was specifically on ThunderX2, the change once more
was made for all archs.

I should note in my tests the retry limit was never reached fwiw.

That aside, the open-close testcase mentioned should match open3.

All that said, I think the thing to do here is to replace cpu_relax
with a dedicated arch-dependent macro, akin to the following:

+               arch_cpu_relax_cmpxchg_loop();
         \
        }
         \
 } while (0)

Then x86-64 would simply:
+#define        arch_cpu_relax_cmpxchg_loop do { } while (0)

Not an actual patch and I don't care about the name, just illustrating
what I mean.

I have to note there are probably numerous other cmpxchg loops without
the pause/fallback treatment, quick grep reveals one in
refcount_dec_not_one, if the fallback and/or cpu_relax thing is indeed
desirable the other loops should probably get augmented to have it.

Any comments?

If you agree with the idea I'll submit a  proper patch.

Linus Torvalds Jan. 13, 2023, 12:13 a.m. UTC | #1

[ Adding linux-arch, which is relevant but not very specific, and the
arm64 and powerpc maintainers that are the more specific cases for an
architecture where this might actually matter.

  See

        https://lore.kernel.org/all/CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com/

  for original full email, but it might be sufficiently clear just
from this heavily cut-down context too ]

Side note on your access() changes - if it turns out that you can
remove all the cred games, we should possibly then revert my old
commit d7852fbd0f04 ("access: avoid the RCU grace period for the
temporary subjective credentials") which avoided the biggest issue
with the unnecessary cred switching.

I *think* access() is the only user of that special 'non_rcu' thing,
but it is possible that the whole 'non_rcu' thing ends up mattering
for cases where the cred actually does change because euid != uid (ie
suid programs), so this would need a bit more effort to do performance
testing on.

On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> To my understanding on said architecture failed cmpxchg still grants you
> exclusive access to the cacheline, making immediate retry preferable
> when trying to inc/dec unless a certain value is found.

I actually suspect that is _always_ the case - this is not like a
contended spinlock where we want to pause because we're waiting for
the value to change and become unlocked, this cmpxchg loop is likely
always better off just retrying with the new value.

That said, the "likely always better off" is purely about performance.

So I have this suspicion that the reason Tony added the cpu_relax()
was simply not about performance, but about other issues, like
fairness in SMT situations.

That said, evern from a fairness perspective the cpu_relax() sounds a
bit odd and unlikely - we're literally yielding when we lost a race,
so it hurts the _loser_, not the winner, and thus might make fairness
worse too.

I dunno.  Tony may have some memory of what the issue was.

> ... without numbers attached to it. Given the above linked thread it
> looks like the arch this was targeting was itanium, not x86-64, but
> the change landed for everyone.

Yeah, if it was ia64-only, it's a non-issue these days. It's dead and
in pure maintenance mode from a kernel perspective (if even that).

> Later it was further augmented with:
> commit 893a7d32e8e04ca4d6c882336b26ed660ca0a48d
> Author: Jan Glauber <jan.glauber@gmail.com>
> Date:   Wed Jun 5 15:48:49 2019 +0200
>
>     lockref: Limit number of cmpxchg loop retries
> [snip]
>     With the retry limit the performance of an open-close testcase
>     improved between 60-70% on ThunderX2.
>
> While the benchmark was specifically on ThunderX2, the change once more
> was made for all archs.

Actually, in that case I did ask for the test to be run on x86
hardware too, and exactly like you found:

> I should note in my tests the retry limit was never reached fwiw.

the max loop retry number just isn't an issue. It fundamentally only
affects extremely unfair platforms, so it's arguably always the right
thing to do.

So it may be "ThunderX2 specific" in that that is where it was
noticed, but I think we can safely just consider the max loop thing to
be a generic safety net that hopefully simply never triggers in
practice on any sane platform.

> All that said, I think the thing to do here is to replace cpu_relax
> with a dedicated arch-dependent macro, akin to the following:

I would actually prefer just removing it entirely and see if somebody
else hollers. You have the numbers to prove it hurts on real hardware,
and I don't think we have any numbers to the contrary.

So I think it's better to trust the numbers and remove it as a
failure, than say "let's just remove it on x86-64 and leave everybody
else with the potentially broken code"

Because I do think that a cmpxchg loop that updates the value it
compares and exchanges is fundamentally different from a "busy-loop,
trying to read while locked", and with your numbers as ammunition, I
think it's better to just remove that cpu_relax() entirely.

Then other architectures can try to run their numbers, and only *if*
it then turns out that they have a reason to do something else should
we make this conditional and different on different architectures.

Let's try to keep the code as common as possibly until we have hard
evidence for special cases, in other words.

                 Linus

Tony Luck Jan. 13, 2023, 12:30 a.m. UTC | #2

> Yeah, if it was ia64-only, it's a non-issue these days. It's dead and
> in pure maintenance mode from a kernel perspective (if even that).

There's not much "simultaneous" in the SMT on ia64. One thread in a
spin loop will hog the core until the h/w switches to the other thread some
number of cycles (hundreds, thousands? I really can remember). So I
was pretty generous with dropping cpu_relax() into any kind of spin loop.

Is it time yet for:

$ git rm -r arch/ia64

-Tony

Linus Torvalds Jan. 13, 2023, 12:45 a.m. UTC | #3

On Thu, Jan 12, 2023 at 6:31 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> There's not much "simultaneous" in the SMT on ia64.

Oh, I forgot about the whole SoEMT fiasco.

Yeah, that might make ia64 act a bit differently here.

But I don't think anybody cares any more, so I don't think that merits
making this a per-architecture choice.

The s390 people hated cpu_relax() here, but for them it was really
because it was bad *everywhere*, and they just made it a no-op (see
commit 22b6430d3665 "locking/core, s390: Make cpu_relax() a barrier
again"). There had been a (failed) attempt at "cpu_relax_lowlatency()"
for the s390 issues.

                  Linus

Mateusz Guzik Jan. 13, 2023, 1:12 a.m. UTC | #4

On 1/13/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Side note on your access() changes - if it turns out that you can
> remove all the cred games, we should possibly then revert my old
> commit d7852fbd0f04 ("access: avoid the RCU grace period for the
> temporary subjective credentials") which avoided the biggest issue
> with the unnecessary cred switching.
>
> I *think* access() is the only user of that special 'non_rcu' thing,
> but it is possible that the whole 'non_rcu' thing ends up mattering
> for cases where the cred actually does change because euid != uid (ie
> suid programs), so this would need a bit more effort to do performance
> testing on.
>

I don't think the games are avoidable. For one I found non-root
processes with non-empty cap_effective even on my laptop, albeit I did
not check how often something like this is doing access().

Discussion for another time.

> On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>> All that said, I think the thing to do here is to replace cpu_relax
>> with a dedicated arch-dependent macro, akin to the following:
>
> I would actually prefer just removing it entirely and see if somebody
> else hollers. You have the numbers to prove it hurts on real hardware,
> and I don't think we have any numbers to the contrary.
>
> So I think it's better to trust the numbers and remove it as a
> failure, than say "let's just remove it on x86-64 and leave everybody
> else with the potentially broken code"
>
[snip]
> Then other architectures can try to run their numbers, and only *if*
> it then turns out that they have a reason to do something else should
> we make this conditional and different on different architectures.
>
> Let's try to keep the code as common as possibly until we have hard
> evidence for special cases, in other words.
>

I did not want to make such a change without redoing the ThunderX2
benchmark, or at least something else arm64-y. I may be able to bench it
tomorrow on whatever arm-y stuff can be found on Amazon's EC2, assuming
no arm64 people show up with their results.

Even then IMHO the safest route is to patch it out on x86-64 and give
other people time to bench their archs as they get around to it, and
ultimately whack the thing if it turns out nobody benefits from it.
I would say beats backpedaling on the removal, but I'm not going to
fight for it.

That said, does waiting for arm64 numbers and/or producing them for the
removal commit message sound like a plan? If so, I'll post soon(tm).

Nicholas Piggin Jan. 13, 2023, 3:20 a.m. UTC | #5

On Fri Jan 13, 2023 at 10:13 AM AEST, Linus Torvalds wrote:
> [ Adding linux-arch, which is relevant but not very specific, and the
> arm64 and powerpc maintainers that are the more specific cases for an
> architecture where this might actually matter.
>
>   See
>
>         https://lore.kernel.org/all/CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com/
>
>   for original full email, but it might be sufficiently clear just
> from this heavily cut-down context too ]
>
> Side note on your access() changes - if it turns out that you can
> remove all the cred games, we should possibly then revert my old
> commit d7852fbd0f04 ("access: avoid the RCU grace period for the
> temporary subjective credentials") which avoided the biggest issue
> with the unnecessary cred switching.
>
> I *think* access() is the only user of that special 'non_rcu' thing,
> but it is possible that the whole 'non_rcu' thing ends up mattering
> for cases where the cred actually does change because euid != uid (ie
> suid programs), so this would need a bit more effort to do performance
> testing on.
>
> On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > To my understanding on said architecture failed cmpxchg still grants you
> > exclusive access to the cacheline, making immediate retry preferable
> > when trying to inc/dec unless a certain value is found.
>
> I actually suspect that is _always_ the case - this is not like a
> contended spinlock where we want to pause because we're waiting for
> the value to change and become unlocked, this cmpxchg loop is likely
> always better off just retrying with the new value.

Yes this should be true for powerpc (POWER CPUs, at least).

> That said, the "likely always better off" is purely about performance.
>
> So I have this suspicion that the reason Tony added the cpu_relax()
> was simply not about performance, but about other issues, like
> fairness in SMT situations.
>
> That said, evern from a fairness perspective the cpu_relax() sounds a
> bit odd and unlikely - we're literally yielding when we lost a race,
> so it hurts the _loser_, not the winner, and thus might make fairness
> worse too.

Worse is that we've also actually just *won* a race when the cmpxchg
comes back, i.e., to get the line exclusive in our cache line. Then
we'll just sit there waiting and probably holding off other snoopers
for a while.

I don't see much of a fairness concern really. If there's a lot of
contention here we'll be stalled for a long time waiting on the line,
so SMT heuristics had better send resources to other threads making
better progress anyway as it should for any cache miss situation. So
this loop shouldn't be hogging up a lot of resources from the other
thread(s).

> I dunno.  Tony may have some memory of what the issue was.
>
> > ... without numbers attached to it. Given the above linked thread it
> > looks like the arch this was targeting was itanium, not x86-64, but
> > the change landed for everyone.
>
> Yeah, if it was ia64-only, it's a non-issue these days. It's dead and
> in pure maintenance mode from a kernel perspective (if even that).
>
> > Later it was further augmented with:
> > commit 893a7d32e8e04ca4d6c882336b26ed660ca0a48d
> > Author: Jan Glauber <jan.glauber@gmail.com>
> > Date:   Wed Jun 5 15:48:49 2019 +0200
> >
> >     lockref: Limit number of cmpxchg loop retries
> > [snip]
> >     With the retry limit the performance of an open-close testcase
> >     improved between 60-70% on ThunderX2.
> >
> > While the benchmark was specifically on ThunderX2, the change once more
> > was made for all archs.
>
> Actually, in that case I did ask for the test to be run on x86
> hardware too, and exactly like you found:
>
> > I should note in my tests the retry limit was never reached fwiw.
>
> the max loop retry number just isn't an issue. It fundamentally only
> affects extremely unfair platforms, so it's arguably always the right
> thing to do.
>
> So it may be "ThunderX2 specific" in that that is where it was
> noticed, but I think we can safely just consider the max loop thing to
> be a generic safety net that hopefully simply never triggers in
> practice on any sane platform.
>

If there are a lot of threads contending I'm sure x86, POWER, probably
most CPUs could quite possibly starve for hundreds if not thousands or
more of iterations here.

And I'm not really a fan of scattering random implementation specific
crutches ad hoc throughout our primitives. At least it could be specific
to the arch where it matters.

Interesting that improves performance so much though. I wonder why?
Hitting the limit will take the lock and that will cause all other CPUs
to drop out of the "fast" path so it will degenerate to a spinlock.
queued spinlock is pretty scalable but it really shouldn't be more
scalable than an atomic OP. I bet this cpu_relax isn't helping, and
probably ll/sc implementation of cmpxchg primitive doesn't help either.

I reckon if you removed the cpu_relax there, big x86 systems actually
wouldn't want that limit because the occasional case where you might hit
the limit will switch behaviour to locking and you might not recover
back to atomics if there is continued pressure on it.

> > All that said, I think the thing to do here is to replace cpu_relax
> > with a dedicated arch-dependent macro, akin to the following:
>
> I would actually prefer just removing it entirely and see if somebody
> else hollers. You have the numbers to prove it hurts on real hardware,
> and I don't think we have any numbers to the contrary.
>
> So I think it's better to trust the numbers and remove it as a
> failure, than say "let's just remove it on x86-64 and leave everybody
> else with the potentially broken code"

Agree. powerpc almost certainly doesn't want it, we would indeed hit
the issue identified which is that we'd get the line exclusive then go
to sleep for a while holding off other snoopers for a bit then probably
losing the line.

It would be nice if ThunderX2 case could be re-tested without that retry
count and without the cpu_relax too.

> Because I do think that a cmpxchg loop that updates the value it
> compares and exchanges is fundamentally different from a "busy-loop,
> trying to read while locked",

It is. powerpc (and probably other ll/sc archs) usually doesn't like this
style of cmpxchg loop.

Firstly, an initial load could bring the line in shared at first, then
it goes exclusive with the larx so 2x the transactions, but actually
worse because if it's shared it can let other CPUs go shared as well,
then the exclusive has to shoot those down, etc.

And if the value changes between those two loads, we fail it and try
again whereas we could have done the update inside the reserve. The
core isn't too happy about seeing a larx when there is already a larx
outstanding because it's worried that it has gone out of order and so it
doesn't want to let the second larx stamp on the first one potentially
before the stcx for the first one has executed. So it can hold the larx
until its the oldest instruction waiting for a stcx that never comes.
It's possible our cmpxchg primitives would actually do better with a
dummy stcx in the failure case to resolve that, but it's not necessarily
better. We don't have a light-weight cancel-reservation instruction
which is what we'd really want. The joys of ll/sc.

Actually what we'd really want is an arch specific implementation of
lockref.

Thanks,
Nick

Linus Torvalds Jan. 13, 2023, 4:08 a.m. UTC | #6

On Thu, Jan 12, 2023 at 7:12 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I did not want to make such a change without redoing the ThunderX2
> benchmark, or at least something else arm64-y. I may be able to bench it
> tomorrow on whatever arm-y stuff can be found on Amazon's EC2, assuming
> no arm64 people show up with their results.

I don't think ThunderX2 itself is particularly interesting, but sure,
it would be good to have numbers for some modern arm64 cores.

The newer Amazon EC2 cores (Graviton 2/3) sound more relevant (or
Ampere?)  The more different architecture numbers we'd have for that
"remove cpu_relax()", the better.

                   Linus

Linus Torvalds Jan. 13, 2023, 4:15 a.m. UTC | #7

On Thu, Jan 12, 2023 at 9:20 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Actually what we'd really want is an arch specific implementation of
> lockref.

The problem is mainly that then you need to generate the asm versions
of all those different CMPXCHG_LOOP()  variants.

They are all fairly simple, though, and it woudln't be hard to make
the current lib/lockref.c just be the generic fallback if you don't
have an arch-specific one.

And even if you do have the arch-specific LL/SC version, you'd still
want the generic fallback for the case where a spinlock isn't a single
word any more (which happens when the spinlock debugging options are
on).

            Linus

Nicholas Piggin Jan. 13, 2023, 5:36 a.m. UTC | #8

On Fri Jan 13, 2023 at 2:15 PM AEST, Linus Torvalds wrote:
> On Thu, Jan 12, 2023 at 9:20 PM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > Actually what we'd really want is an arch specific implementation of
> > lockref.
>
> The problem is mainly that then you need to generate the asm versions
> of all those different CMPXCHG_LOOP()  variants.
>
> They are all fairly simple, though, and it woudln't be hard to make
> the current lib/lockref.c just be the generic fallback if you don't
> have an arch-specific one.

Yeah, it doesn't look too onerous so it's probably worth seeing what
the code and some numbers look like here.

> And even if you do have the arch-specific LL/SC version, you'd still
> want the generic fallback for the case where a spinlock isn't a single
> word any more (which happens when the spinlock debugging options are
> on).

You're right, good point.

Thanks,
Nick

Ard Biesheuvel Jan. 13, 2023, 7:55 a.m. UTC | #9

On Fri, 13 Jan 2023 at 01:31, Luck, Tony <tony.luck@intel.com> wrote:
>
> > Yeah, if it was ia64-only, it's a non-issue these days. It's dead and
> > in pure maintenance mode from a kernel perspective (if even that).
>
> There's not much "simultaneous" in the SMT on ia64. One thread in a
> spin loop will hog the core until the h/w switches to the other thread some
> number of cycles (hundreds, thousands? I really can remember). So I
> was pretty generous with dropping cpu_relax() into any kind of spin loop.
>
> Is it time yet for:
>
> $ git rm -r arch/ia64
>

Hi Tony,

Can I take that as an ack on [0]? The EFI subsystem has evolved
substantially over the years, and there is really no way to do any
IA64 testing beyond build testing, so from that perspective, dropping
it entirely would be welcomed.

Thanks,
Ard.



[0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=remove-ia64

Will Deacon Jan. 13, 2023, 9:46 a.m. UTC | #10

On Fri, Jan 13, 2023 at 02:12:50AM +0100, Mateusz Guzik wrote:
> On 1/13/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Side note on your access() changes - if it turns out that you can
> > remove all the cred games, we should possibly then revert my old
> > commit d7852fbd0f04 ("access: avoid the RCU grace period for the
> > temporary subjective credentials") which avoided the biggest issue
> > with the unnecessary cred switching.
> >
> > I *think* access() is the only user of that special 'non_rcu' thing,
> > but it is possible that the whole 'non_rcu' thing ends up mattering
> > for cases where the cred actually does change because euid != uid (ie
> > suid programs), so this would need a bit more effort to do performance
> > testing on.
> >
> 
> I don't think the games are avoidable. For one I found non-root
> processes with non-empty cap_effective even on my laptop, albeit I did
> not check how often something like this is doing access().
> 
> Discussion for another time.
> 
> > On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> All that said, I think the thing to do here is to replace cpu_relax
> >> with a dedicated arch-dependent macro, akin to the following:
> >
> > I would actually prefer just removing it entirely and see if somebody
> > else hollers. You have the numbers to prove it hurts on real hardware,
> > and I don't think we have any numbers to the contrary.
> >
> > So I think it's better to trust the numbers and remove it as a
> > failure, than say "let's just remove it on x86-64 and leave everybody
> > else with the potentially broken code"
> >
> [snip]
> > Then other architectures can try to run their numbers, and only *if*
> > it then turns out that they have a reason to do something else should
> > we make this conditional and different on different architectures.
> >
> > Let's try to keep the code as common as possibly until we have hard
> > evidence for special cases, in other words.
> >
> 
> I did not want to make such a change without redoing the ThunderX2
> benchmark, or at least something else arm64-y. I may be able to bench it
> tomorrow on whatever arm-y stuff can be found on Amazon's EC2, assuming
> no arm64 people show up with their results.
> 
> Even then IMHO the safest route is to patch it out on x86-64 and give
> other people time to bench their archs as they get around to it, and
> ultimately whack the thing if it turns out nobody benefits from it.
> I would say beats backpedaling on the removal, but I'm not going to
> fight for it.
> 
> That said, does waiting for arm64 numbers and/or producing them for the
> removal commit message sound like a plan? If so, I'll post soon(tm).

Honestly, I wouldn't worry about us (arm64) here. I don't think any real
hardware implements the YIELD instruction (i.e. it behaves as a NOP in
practice). The only place I'm aware of where it _does_ something is in
QEMU, which was actually the motivation behind having it in cpu_relax() to
start with (see 1baa82f48030 ("arm64: Implement cpu_relax as yield")).

So, from the arm64 side of the fence, I'm perfectly happy just removing
the cpu_relax() calls from lockref.

Will

Peter Zijlstra Jan. 13, 2023, 10:23 a.m. UTC | #11

On Thu, Jan 12, 2023 at 06:13:16PM -0600, Linus Torvalds wrote:
> On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > To my understanding on said architecture failed cmpxchg still grants you
> > exclusive access to the cacheline, making immediate retry preferable
> > when trying to inc/dec unless a certain value is found.
> 
> I actually suspect that is _always_ the case - this is not like a
> contended spinlock where we want to pause because we're waiting for
> the value to change and become unlocked, this cmpxchg loop is likely
> always better off just retrying with the new value.
> 
> That said, the "likely always better off" is purely about performance.
> 
> So I have this suspicion that the reason Tony added the cpu_relax()
> was simply not about performance, but about other issues, like
> fairness in SMT situations.
> 
> That said, evern from a fairness perspective the cpu_relax() sounds a
> bit odd and unlikely - we're literally yielding when we lost a race,
> so it hurts the _loser_, not the winner, and thus might make fairness
> worse too.

I've been writing cmpxchg loops that have strict termination conditions
without cpu_relax() in them for a while now.

For example, the x86 atomic_fetch_and() implementation looks like so:

static __always_inline int arch_atomic_fetch_and(int i, atomic_t *v)
{
	int val = arch_atomic_read(v);

	do { } while (!arch_atomic_try_cmpxchg(v, &val, val & i));

	return val;
}

And I did that because of the exact same argument you had above, it
needs to do the op anyway, waiting between failed attempts will only
increase the chance it will fail again.

Tony Luck Jan. 13, 2023, 4:17 p.m. UTC | #12

>> Is it time yet for:
>>
>> $ git rm -r arch/ia64
>>

> Can I take that as an ack on [0]? The EFI subsystem has evolved
> substantially over the years, and there is really no way to do any
> IA64 testing beyond build testing, so from that perspective, dropping
> it entirely would be welcomed.
>
> Thanks,
> Ard.
>
>
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=remove-ia64

Yes. EFI isn't the only issue. A bunch of folks[1] have spent time fixing ia64
for (in most cases) some tree-wide patches that they needed. Their time might
have been more productively spent fixing things that actually matter in modern times.

Acked-by: Tony Luck <tony.luck@intel.com>

-Tony

[1] git log --no-merges --since=2year -- arch/ia64 | grep Author: | sort | uniq -c | sort -rn
     19 Author: Masahiro Yamada <masahiroy@kernel.org>
     11 Author: Sergei Trofimovich <slyfox@gentoo.org>
      9 Author: Eric W. Biederman <ebiederm@xmission.com>
      8 Author: Arnd Bergmann <arnd@arndb.de>
      6 Author: Randy Dunlap <rdunlap@infradead.org>
      6 Author: Kefeng Wang <wangkefeng.wang@huawei.com>
      6 Author: Anshuman Khandual <anshuman.khandual@arm.com>
      5 Author: Masami Hiramatsu <mhiramat@kernel.org>
      5 Author: Al Viro <viro@zeniv.linux.org.uk>
      4 Author: Peter Zijlstra <peterz@infradead.org>
      4 Author: Mike Rapoport <rppt@kernel.org>
      4 Author: David Hildenbrand <david@redhat.com>
      4 Author: Christophe Leroy <christophe.leroy@csgroup.eu>
      3 Author: Yury Norov <yury.norov@gmail.com>
      3 Author: Michal Hocko <mhocko@suse.com>
      3 Author: Geert Uytterhoeven <geert+renesas@glider.be>
      3 Author: Bhaskar Chowdhury <unixbhaskar@gmail.com>
      3 Author: Baoquan He <bhe@redhat.com>
      3 Author: Ard Biesheuvel <ardb@kernel.org>
      3 Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      2 Author: Yang Guang <yang.guang5@zte.com.cn>
      2 Author: Will Deacon <will@kernel.org>
      2 Author: Viresh Kumar <viresh.kumar@linaro.org>
      2 Author: Valentin Schneider <vschneid@redhat.com>
      2 Author: Stafford Horne <shorne@gmail.com>
      2 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      2 Author: Richard Guy Briggs <rgb@redhat.com>
      2 Author: Peter Xu <peterx@redhat.com>
      2 Author: Peter Collingbourne <pcc@google.com>
      2 Author: Mark Rutland <mark.rutland@arm.com>
      2 Author: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      2 Author: Julia Lawall <Julia.Lawall@inria.fr>
      2 Author: Jens Axboe <axboe@kernel.dk>
      2 Author: Jason Wang <wangborong@cdjrlc.com>
      2 Author: Jan Kara <jack@suse.cz>
      2 Author: Christoph Hellwig <hch@lst.de>
      2 Author: Bjorn Helgaas <bhelgaas@google.com>
      2 Author: Alexander Lobakin <alexandr.lobakin@intel.com>
      1 Author: Zi Yan <ziy@nvidia.com>
      1 Author: Zhang Yunkai <zhang.yunkai@zte.com.cn>
      1 Author: ye xingchen <ye.xingchen@zte.com.cn>
      1 Author: xu xin <xu.xin16@zte.com.cn>
      1 Author: Wolfram Sang <wsa+renesas@sang-engineering.com>
      1 Author: Weizhao Ouyang <o451686892@gmail.com>
      1 Author: Suren Baghdasaryan <surenb@google.com>
      1 Author: Souptick Joarder (HPE) <jrdr.linux@gmail.com>
      1 Author: Sergey Shtylyov <s.shtylyov@omp.ru>
      1 Author: Sergei Trofimovich <slyich@gmail.com>
      1 Author: Sascha Hauer <s.hauer@pengutronix.de>
      1 Author: Samuel Holland <samuel@sholland.org>
      1 Author: Qi Zheng <zhengqi.arch@bytedance.com>
      1 Author: Peng Liu <liupeng256@huawei.com>
      1 Author: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      1 Author: Muchun Song <muchun.song@linux.dev>
      1 Author: Mikulas Patocka <mpatocka@redhat.com>
      1 Author: Mike Kravetz <mike.kravetz@oracle.com>
      1 Author: Mickaël Salaün <mic@linux.microsoft.com>
      1 Author: Matthew Wilcox (Oracle) <willy@infradead.org>
      1 Author: Martin Oliveira <martin.oliveira@eideticom.com>
      1 Author: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
      1 Author: Kees Cook <keescook@chromium.org>
      1 Author: Jason A. Donenfeld <Jason@zx2c4.com>
      1 Author: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
      1 Author: Haowen Bai <baihaowen@meizu.com>
      1 Author: Gustavo A. R. Silva <gustavoars@kernel.org>
      1 Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      1 Author: Gaosheng Cui <cuigaosheng1@huawei.com>
      1 Author: Dmitry Osipenko <dmitry.osipenko@collabora.com>
      1 Author: Dawei Li <set_pte_at@outlook.com>
      1 Author: Chuck Lever <chuck.lever@oracle.com>
      1 Author: Christian Brauner <brauner@kernel.org>
      1 Author: Chris Down <chris@chrisdown.name>
      1 Author: Chen Li <chenli@uniontech.com>
      1 Author: Catalin Marinas <catalin.marinas@arm.com>
      1 Author: Benjamin Stürz <benni@stuerz.xyz>
      1 Author: Ben Dooks <ben-linux@fluff.org>
      1 Author: Baolin Wang <baolin.wang@linux.alibaba.com>
      1 Author: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      1 Author: André Almeida <andrealmeid@igalia.com>

Jessica Clarke Jan. 13, 2023, 8:49 p.m. UTC | #13

On Fri, Jan 13, 2023 at 08:55:41AM +0100, Ard Biesheuvel wrote:
> On Fri, 13 Jan 2023 at 01:31, Luck, Tony <tony.luck@intel.com> wrote:
> >
> > > Yeah, if it was ia64-only, it's a non-issue these days. It's dead and
> > > in pure maintenance mode from a kernel perspective (if even that).
> >
> > There's not much "simultaneous" in the SMT on ia64. One thread in a
> > spin loop will hog the core until the h/w switches to the other thread some
> > number of cycles (hundreds, thousands? I really can remember). So I
> > was pretty generous with dropping cpu_relax() into any kind of spin loop.
> >
> > Is it time yet for:
> >
> > $ git rm -r arch/ia64
> >
> 
> Hi Tony,
> 
> Can I take that as an ack on [0]? The EFI subsystem has evolved
> substantially over the years, and there is really no way to do any
> IA64 testing beyond build testing, so from that perspective, dropping
> it entirely would be welcomed.

For what it's worth, Debian and Gentoo both have ia64 ports with active
users (6.1 looks like it currently fails to build in Debian due to a
minor packaging issue, but various versions of 6.0 were built and
published, and one of those is running on the one ia64 Debian builder I
personally have access to).

Jess

Tony Luck Jan. 13, 2023, 9:03 p.m. UTC | #14

> For what it's worth, Debian and Gentoo both have ia64 ports with active
> users (6.1 looks like it currently fails to build in Debian due to a
> minor packaging issue, but various versions of 6.0 were built and
> published, and one of those is running on the one ia64 Debian builder I
> personally have access to).

Jess,

So dropping ia64 from the upstream kernel won't just save time of kernel
developers. It will also save time for the folks keeping Debian and Gentoo
ports up and running.

Are there people actually running production systems on ia64 that also
update to v6.x kernels?

If so, why? Just scrap the machine and replace with almost anything else.
You'll cover the cost of the new machine in short order with the savings on
your power bill.

-Tony

Jessica Clarke Jan. 13, 2023, 9:04 p.m. UTC | #15

On 13 Jan 2023, at 21:03, Luck, Tony <tony.luck@intel.com> wrote:
> 
>> For what it's worth, Debian and Gentoo both have ia64 ports with active
>> users (6.1 looks like it currently fails to build in Debian due to a
>> minor packaging issue, but various versions of 6.0 were built and
>> published, and one of those is running on the one ia64 Debian builder I
>> personally have access to).
> 
> Jess,
> 
> So dropping ia64 from the upstream kernel won't just save time of kernel
> developers. It will also save time for the folks keeping Debian and Gentoo
> ports up and running.
> 
> Are there people actually running production systems on ia64 that also
> update to v6.x kernels?
> 
> If so, why? Just scrap the machine and replace with almost anything else.
> You'll cover the cost of the new machine in short order with the savings on
> your power bill.

Hobbyists, same as alpha, hppa, m68k, sh and any other such
architectures that have no real use in this day and age.

Jess

John Paul Adrian Glaubitz Jan. 13, 2023, 9:05 p.m. UTC | #16

Hello Ard!

> Can I take that as an ack on [0]? The EFI subsystem has evolved
> substantially over the years, and there is really no way to do any
> IA64 testing beyond build testing, so from that perspective, dropping
> it entirely would be welcomed.

ia64 is regularly tested in Debian and Gentoo [1][2].

Debian's ia64 porterbox yttrium runs a recent kernel without issues:

root@yttrium:~# uname -a
Linux yttrium 5.19.0-2-mckinley #1 SMP Debian 5.19.11-1 (2022-09-24) ia64 GNU/Linux
root@yttrium:~#

root@yttrium:~# journalctl -b|head -n10
Nov 14 14:46:10 yttrium kernel: Linux version 5.19.0-2-mckinley (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.3.0-6) 11.3.0, GNU ld (GNU Binutils for Debian) 2.39) #1 SMP Debian 5.19.11-1 (2022-09-24)
Nov 14 14:46:10 yttrium kernel: efi: EFI v2.10 by HP
Nov 14 14:46:10 yttrium kernel: efi: SALsystab=0xdfdd63a18 ESI=0xdfdd63f18 ACPI 2.0=0x3d3c4014 HCDP=0xdffff8798 SMBIOS=0x3d368000
Nov 14 14:46:10 yttrium kernel: PCDP: v3 at 0xdffff8798
Nov 14 14:46:10 yttrium kernel: earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
Nov 14 14:46:10 yttrium kernel: printk: bootconsole [uart8250] enabled
Nov 14 14:46:10 yttrium kernel: ACPI: Early table checksum verification disabled
Nov 14 14:46:10 yttrium kernel: ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
Nov 14 14:46:10 yttrium kernel: ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2 00000001      01000013)
Nov 14 14:46:10 yttrium kernel: ACPI: FACP 0x000000003D3BE000 0000F4 (v03 HP     RX2800-2 00000001 HP   00000001)
root@yttrium:~#

Same applies to the buildds:

root@lifshitz:~# uname -a
Linux lifshitz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux
root@lifshitz:~#

root@lenz:~# uname -a
Linux lenz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux
root@lenz:~#

EFI works fine as well using the latest version of GRUB2.

Thanks,
Adrian

> [1] https://cdimage.debian.org/cdimage/ports/snapshots/
> [2] https://mirror.yandex.ru/gentoo-distfiles//releases/ia64/autobuilds/

Ard Biesheuvel Jan. 13, 2023, 11:25 p.m. UTC | #17

On Fri, 13 Jan 2023 at 22:06, John Paul Adrian Glaubitz
<glaubitz@physik.fu-berlin.de> wrote:
>
> Hello Ard!
>
> > Can I take that as an ack on [0]? The EFI subsystem has evolved
> > substantially over the years, and there is really no way to do any
> > IA64 testing beyond build testing, so from that perspective, dropping
> > it entirely would be welcomed.
>
> ia64 is regularly tested in Debian and Gentoo [1][2].
>
> Debian's ia64 porterbox yttrium runs a recent kernel without issues:
>
> root@yttrium:~# uname -a
> Linux yttrium 5.19.0-2-mckinley #1 SMP Debian 5.19.11-1 (2022-09-24) ia64 GNU/Linux
> root@yttrium:~#
>
> root@yttrium:~# journalctl -b|head -n10
> Nov 14 14:46:10 yttrium kernel: Linux version 5.19.0-2-mckinley (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.3.0-6) 11.3.0, GNU ld (GNU Binutils for Debian) 2.39) #1 SMP Debian 5.19.11-1 (2022-09-24)
> Nov 14 14:46:10 yttrium kernel: efi: EFI v2.10 by HP
> Nov 14 14:46:10 yttrium kernel: efi: SALsystab=0xdfdd63a18 ESI=0xdfdd63f18 ACPI 2.0=0x3d3c4014 HCDP=0xdffff8798 SMBIOS=0x3d368000
> Nov 14 14:46:10 yttrium kernel: PCDP: v3 at 0xdffff8798
> Nov 14 14:46:10 yttrium kernel: earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
> Nov 14 14:46:10 yttrium kernel: printk: bootconsole [uart8250] enabled
> Nov 14 14:46:10 yttrium kernel: ACPI: Early table checksum verification disabled
> Nov 14 14:46:10 yttrium kernel: ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
> Nov 14 14:46:10 yttrium kernel: ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2 00000001      01000013)
> Nov 14 14:46:10 yttrium kernel: ACPI: FACP 0x000000003D3BE000 0000F4 (v03 HP     RX2800-2 00000001 HP   00000001)
> root@yttrium:~#
>
> Same applies to the buildds:
>
> root@lifshitz:~# uname -a
> Linux lifshitz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux
> root@lifshitz:~#
>
> root@lenz:~# uname -a
> Linux lenz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux
> root@lenz:~#
>
> EFI works fine as well using the latest version of GRUB2.
>
> Thanks,
> Adrian
>
> > [1] https://cdimage.debian.org/cdimage/ports/snapshots/
> > [2] https://mirror.yandex.ru/gentoo-distfiles//releases/ia64/autobuilds/

Thanks for reporting back. I (mis)read the debian ports page [3],
which mentions Debian 7 as the highest Debian version that supports
IA64, and so I assumed that support had been dropped from Debian.

However, if only a handful of people want to keep this port alive for
reasons of nostalgia, it is obviously obsolete, and we should ask
ourselves whether it is reasonable to expect Linux contributors to
keep spending time on this.

Does the Debian ia64 port have any users? Or is the system that builds
the packages the only one that consumes them?


[3] https://www.debian.org/ports/ia64/

Sedat Dilek Jan. 14, 2023, 11:24 a.m. UTC | #18

On Sat, Jan 14, 2023 at 12:43 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Fri, 13 Jan 2023 at 22:06, John Paul Adrian Glaubitz
> <glaubitz@physik.fu-berlin.de> wrote:
> >
> > Hello Ard!
> >
> > > Can I take that as an ack on [0]? The EFI subsystem has evolved
> > > substantially over the years, and there is really no way to do any
> > > IA64 testing beyond build testing, so from that perspective, dropping
> > > it entirely would be welcomed.
> >
> > ia64 is regularly tested in Debian and Gentoo [1][2].
> >
> > Debian's ia64 porterbox yttrium runs a recent kernel without issues:
> >
> > root@yttrium:~# uname -a
> > Linux yttrium 5.19.0-2-mckinley #1 SMP Debian 5.19.11-1 (2022-09-24) ia64 GNU/Linux
> > root@yttrium:~#
> >
> > root@yttrium:~# journalctl -b|head -n10
> > Nov 14 14:46:10 yttrium kernel: Linux version 5.19.0-2-mckinley (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.3.0-6) 11.3.0, GNU ld (GNU Binutils for Debian) 2.39) #1 SMP Debian 5.19.11-1 (2022-09-24)
> > Nov 14 14:46:10 yttrium kernel: efi: EFI v2.10 by HP
> > Nov 14 14:46:10 yttrium kernel: efi: SALsystab=0xdfdd63a18 ESI=0xdfdd63f18 ACPI 2.0=0x3d3c4014 HCDP=0xdffff8798 SMBIOS=0x3d368000
> > Nov 14 14:46:10 yttrium kernel: PCDP: v3 at 0xdffff8798
> > Nov 14 14:46:10 yttrium kernel: earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
> > Nov 14 14:46:10 yttrium kernel: printk: bootconsole [uart8250] enabled
> > Nov 14 14:46:10 yttrium kernel: ACPI: Early table checksum verification disabled
> > Nov 14 14:46:10 yttrium kernel: ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
> > Nov 14 14:46:10 yttrium kernel: ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2 00000001      01000013)
> > Nov 14 14:46:10 yttrium kernel: ACPI: FACP 0x000000003D3BE000 0000F4 (v03 HP     RX2800-2 00000001 HP   00000001)
> > root@yttrium:~#
> >
> > Same applies to the buildds:
> >
> > root@lifshitz:~# uname -a
> > Linux lifshitz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux
> > root@lifshitz:~#
> >
> > root@lenz:~# uname -a
> > Linux lenz 6.0.0-4-mckinley #1 SMP Debian 6.0.8-1 (2022-11-11) ia64 GNU/Linux
> > root@lenz:~#
> >
> > EFI works fine as well using the latest version of GRUB2.
> >
> > Thanks,
> > Adrian
> >
> > > [1] https://cdimage.debian.org/cdimage/ports/snapshots/
> > > [2] https://mirror.yandex.ru/gentoo-distfiles//releases/ia64/autobuilds/
>
> Thanks for reporting back. I (mis)read the debian ports page [3],
> which mentions Debian 7 as the highest Debian version that supports
> IA64, and so I assumed that support had been dropped from Debian.
>
> However, if only a handful of people want to keep this port alive for
> reasons of nostalgia, it is obviously obsolete, and we should ask
> ourselves whether it is reasonable to expect Linux contributors to
> keep spending time on this.
>
> Does the Debian ia64 port have any users? Or is the system that builds
> the packages the only one that consumes them?
>
>
> [3] https://www.debian.org/ports/ia64/

I have no IA64 hardware or be a user of it or have any strong feelings
to keep this arch in the Linux-kernel.

But I am a Debianist (Debian/unstable AMD64 user).

Best is to ask the Debian release-team or (if there exist) maintainers
or responsibles for the IA64 port - which is an ***unofficial*** port.

What I found... on <cdimage.debian.org>:

https://cdimage.debian.org/cdimage/ > Ports

https://cdimage.debian.org/cdimage/ports/current/

https://cdimage.debian.org/cdimage/ports/current-debian-installer/ia64/debian-installer-images_20211020_ia64.tar.gz
^^ Last modified: 2021-10-20 22:52

https://cdimage.debian.org/cdimage/ports/current/debian-11.0.0-ia64-NETINST-1.iso
^^ Last modofied: 2022-03-28 14:18

With a net-install image you should be able to setup and explore the
IA64 Debian cosmos.

Example #1: binutils packages

Checking available binutils package for Debian/unstable IA64 (version:
2.39.90.20230110-1):

https://packages.debian.org/sid/binutils <--- Clearly states IA64 as
"unofficial port"
https://packages.debian.org/sid/ia64/binutils/filelist

Example #2: linux-image packages

Cannot say what this means...

https://packages.debian.org/search?arch=amd64&keywords=linux-image
(AMD64 - matches)

https://packages.debian.org/search?arch=ia64&keywords=linux-image
(IA64 - no matches)

https://packages.debian.org/search?arch=ia64&keywords=linux (IA64 -
matches - but no linux-image which ships normally a bootable
Linux-kernel)

As stated I have no expertise in Debian whatever release for IA64 arch.

Hope that helps.

-Sedat-

Sedat Dilek Jan. 14, 2023, 11:28 a.m. UTC | #19

[ ... ]

> Best is to ask the Debian release-team or (if there exist) maintainers
> or responsibles for the IA64 port - which is an ***unofficial*** port.
>

Here we go:

https://lists.debian.org/debian-ia64/

Posting address: debian-ia64@lists.debian.org

Found via <https://lists.debian.org/completeindex.html>

-Sedat-

Matthew Wilcox Jan. 15, 2023, 12:27 a.m. UTC | #20

On Sat, Jan 14, 2023 at 12:28:30PM +0100, Sedat Dilek wrote:
> [ ... ]
> 
> > Best is to ask the Debian release-team or (if there exist) maintainers
> > or responsibles for the IA64 port - which is an ***unofficial*** port.
> >
> 
> Here we go:
> 
> https://lists.debian.org/debian-ia64/
> 
> Posting address: debian-ia64@lists.debian.org
> 
> Found via <https://lists.debian.org/completeindex.html>

More useful perhaps is to look at https://popcon.debian.org/

There are three machines reporting popcon results.  It's dead.

Sedat Dilek Jan. 15, 2023, 12:04 p.m. UTC | #21

On Sun, Jan 15, 2023 at 1:27 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Jan 14, 2023 at 12:28:30PM +0100, Sedat Dilek wrote:
> > [ ... ]
> >
> > > Best is to ask the Debian release-team or (if there exist) maintainers
> > > or responsibles for the IA64 port - which is an ***unofficial*** port.
> > >
> >
> > Here we go:
> >
> > https://lists.debian.org/debian-ia64/
> >
> > Posting address: debian-ia64@lists.debian.org
> >
> > Found via <https://lists.debian.org/completeindex.html>
>
> More useful perhaps is to look at https://popcon.debian.org/
>
> There are three machines reporting popcon results.  It's dead.

Exactly, Debian Popularity Contest was what I was looking for yesterday.

Thanks Matthew.

[1] says in Inst (204701):

Name                              || Number  || %
==================================
binutils-x86-64-linux-gnu || 101548  || 49.61%
binutils-ia64-linux-gnu     ||          11  || 0.01%

HELP: Inst. is the number of people who installed this package (sum of
the four categories below)

There may be more popular packages than binutils.
( binutils might tell something about development happening or not. )

Anyway, I am not a popcon expert and never participated in Debian's
Popularity Contest.

-Sedat-

[1] https://qa.debian.org/popcon.php?package=binutils

John Paul Adrian Glaubitz Jan. 16, 2023, 9:32 a.m. UTC | #22

Hi Ard!

On 1/14/23 00:25, Ard Biesheuvel wrote:
> Thanks for reporting back. I (mis)read the debian ports page [3],
> which mentions Debian 7 as the highest Debian version that supports
> IA64, and so I assumed that support had been dropped from Debian.

This page talks about officially supported ports. Debian Ports is an
unofficial spin maintained by a number of Debian Developers and external
developers that are volunteering to maintain these ports.

> However, if only a handful of people want to keep this port alive for
> reasons of nostalgia, it is obviously obsolete, and we should ask
> ourselves whether it is reasonable to expect Linux contributors to
> keep spending time on this.

You could say this about a lot of hardware, can't you?

> Does the Debian ia64 port have any users? Or is the system that builds
> the packages the only one that consumes them?

There is the popcon statistics. However, that is opt-on and the numbers are
not really trustworthy. We are getting feedback from time to time from people
using it.

Is there any problem with the ia64 port at the moment that would justify removal?

Adrian

John Paul Adrian Glaubitz Jan. 16, 2023, 9:37 a.m. UTC | #23

On 1/14/23 12:24, Sedat Dilek wrote:
> Example #1: binutils packages
> 
> Checking available binutils package for Debian/unstable IA64 (version:
> 2.39.90.20230110-1):
> 
> https://packages.debian.org/sid/binutils <--- Clearly states IA64 as
> "unofficial port"

And?

> https://packages.debian.org/sid/ia64/binutils/filelist
> 
> Example #2: linux-image packages
> 
> Cannot say what this means...
> 
> https://packages.debian.org/search?arch=amd64&keywords=linux-image
> (AMD64 - matches)
> 
> https://packages.debian.org/search?arch=ia64&keywords=linux-image
> (IA64 - no matches)
> 
> https://packages.debian.org/search?arch=ia64&keywords=linux (IA64 -
> matches - but no linux-image which ships normally a bootable
> Linux-kernel)

No, the package is called "linux-image-$ARCH-$FLAVOR". There is no "linux-image"
package for amd64 either. Does that prove amd64 is dead?

What you are looking for can be found here:

> http://ftp.ports.debian.org/debian-ports/pool-ia64/main/l/linux/

> As stated I have no expertise in Debian whatever release for IA64 arch.

Well, maybe let me answer the questions then since I am maintaining the port
in Debian.

Adrian

John Paul Adrian Glaubitz Jan. 16, 2023, 9:40 a.m. UTC | #24

On 1/14/23 12:28, Sedat Dilek wrote:
> [ ... ]
> 
>> Best is to ask the Debian release-team or (if there exist) maintainers
>> or responsibles for the IA64 port - which is an ***unofficial*** port.
>>
> 
> Here we go:
> 
> https://lists.debian.org/debian-ia64/
> 
> Posting address: debian-ia64@lists.debian.org
> 
> Found via <https://lists.debian.org/completeindex.html>

And the wiki lists Jason Duerstock, Jessica Clarke and me as maintainers:

> https://wiki.debian.org/Ports/ia64

Adrian

John Paul Adrian Glaubitz Jan. 16, 2023, 9:41 a.m. UTC | #25

On 1/15/23 01:27, Matthew Wilcox wrote:
> More useful perhaps is to look at https://popcon.debian.org/
> 
> There are three machines reporting popcon results.  It's dead.

It's an opt-in mechanism that reports 190,000 machines running Debian
on x86_64. Do you think that there are only 190,000 servers world-wide
running Debian?

Adrian

John Paul Adrian Glaubitz Jan. 16, 2023, 9:42 a.m. UTC | #26

On 1/15/23 13:04, Sedat Dilek wrote:
> Exactly, Debian Popularity Contest was what I was looking for yesterday.
> 
> Thanks Matthew.
> 
> [1] says in Inst (204701):
> 
> Name                              || Number  || %
> ==================================
> binutils-x86-64-linux-gnu || 101548  || 49.61%
> binutils-ia64-linux-gnu     ||          11  || 0.01%
> 
> HELP: Inst. is the number of people who installed this package (sum of
> the four categories below)
> 
> There may be more popular packages than binutils.
> ( binutils might tell something about development happening or not. )
> 
> Anyway, I am not a popcon expert and never participated in Debian's
> Popularity Contest.

And that's the point, it's opt-in!

Adrian

Ard Biesheuvel Jan. 16, 2023, 10:09 a.m. UTC | #27

On Mon, 16 Jan 2023 at 10:33, John Paul Adrian Glaubitz
<glaubitz@physik.fu-berlin.de> wrote:
>
> Hi Ard!
>
> On 1/14/23 00:25, Ard Biesheuvel wrote:
> > Thanks for reporting back. I (mis)read the debian ports page [3],
> > which mentions Debian 7 as the highest Debian version that supports
> > IA64, and so I assumed that support had been dropped from Debian.
>
> This page talks about officially supported ports. Debian Ports is an
> unofficial spin maintained by a number of Debian Developers and external
> developers that are volunteering to maintain these ports.
>
> > However, if only a handful of people want to keep this port alive for
> > reasons of nostalgia, it is obviously obsolete, and we should ask
> > ourselves whether it is reasonable to expect Linux contributors to
> > keep spending time on this.
>
> You could say this about a lot of hardware, can't you?
>

Uhm, yes. Linux contributor effort is a scarce resource, and spending
it on architectures that nobody actually uses, such as alpha or ia64,
means it is not spent on things that are useful to more people.

I really do sympathize with the enthusiast/hobbyist PoV - I am also an
engineer that likes to tinker. So 'use' can be defined liberally here,
and cover running the latest Linux on ancient hardware just for
entertainment.

However, the question is not how you or I choose to spend (or waste)
their time. The question is whether it is reasonable *as a community*
to insist that everyone who contributes a cross-architecture change
also has to ensure that obsolete architectures such as i64 or alpha
are not left behind.

The original thread is an interesting example here - removing a
cpu_relax() in cmpxchg() that was only there because of IA64's clunky
SMT implementation. Perhaps this means that IA64 performance is going
to regress substantially for some workloads? Should anyone care?
Should we test such changes first? And how should we do that if there
is no maintainer and nobody has access to the hardware?

The other example is EFI, which i maintain. Should I require from
contributors that they build and boot test EFI changes on ia64 if I
myself don't even have access to the hardware? It is good to know that
things don't seem to be broken today, but if it is going to fall over,
it may take a while before anybody notices. What happens then?

> > Does the Debian ia64 port have any users? Or is the system that builds
> > the packages the only one that consumes them?
>
> There is the popcon statistics. However, that is opt-on and the numbers are
> not really trustworthy. We are getting feedback from time to time from people
> using it.
>
> Is there any problem with the ia64 port at the moment that would justify removal?
>

I would argue that we should mark it obsolete at the very least, so
that it is crystal clear that regressing IA64 (either knowingly or
unknowingly) by a generic or cross-architecture change is not a
showstopper, even at build time. Then, if someone has the skill set
and the time on their hands, as well as access to actual hardware,
they can keep it alive if they want to.

Matthew Wilcox Jan. 16, 2023, 1:28 p.m. UTC | #28

On Mon, Jan 16, 2023 at 10:41:14AM +0100, John Paul Adrian Glaubitz wrote:
> On 1/15/23 01:27, Matthew Wilcox wrote:
> > More useful perhaps is to look at https://popcon.debian.org/
> > 
> > There are three machines reporting popcon results.  It's dead.
> 
> It's an opt-in mechanism that reports 190,000 machines running Debian
> on x86_64. Do you think that there are only 190,000 servers world-wide
> running Debian?

No.  I think there are so few ia64 machines still in operation that the
effort required to continue to support them is now too high relative
to the benefits.  We've dropped support for hardware that still exists
before, and we'll do it again.  The only question is when.

I still have two ia64 machines in my basement.  I've turned one of them
on once since 2009.  And that was because I had a bug I needed to track
down and fix (2018, commits 4b664e739f77 and 2879b65f9de8).

David Howells Jan. 16, 2023, 2:08 p.m. UTC | #29

Hi Linus,

I'm not sure how relevant it is to the topic, but I seem to remember you
having a go at implementing spinlocks with x86_64 memory transaction
instructions a while back.  Do you have any thoughts on whether these
instructions are ever likely to become something we can use?

I was looking specifically at the skbuff queue stuff which does {
spin_lock_irq, add to list, inc count, spin_unlock_irq } and thinking that
might be a good place to use such a thing.

David

Matthew Wilcox Jan. 16, 2023, 3:09 p.m. UTC | #30

On Mon, Jan 16, 2023 at 02:08:15PM +0000, David Howells wrote:
> Hi Linus,
> 
> I'm not sure how relevant it is to the topic, but I seem to remember you
> having a go at implementing spinlocks with x86_64 memory transaction
> instructions a while back.  Do you have any thoughts on whether these
> instructions are ever likely to become something we can use?

Ever is a long time, but not while they're still buggy:

https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

and not while they're not actually available on a vast majority of
x86 hardware.  ie AMD needs to implement them, make them available as
standard, forcing Intel to enable them globally instead of restricting
them to those who pay the $2.50/month subscription fee.

Linus Torvalds Jan. 16, 2023, 4:59 p.m. UTC | #31

On Mon, Jan 16, 2023 at 6:08 AM David Howells <dhowells@redhat.com> wrote:
>
> I'm not sure how relevant it is to the topic, but I seem to remember you
> having a go at implementing spinlocks with x86_64 memory transaction
> instructions a while back.  Do you have any thoughts on whether these
> instructions are ever likely to become something we can use?

Nope.

Not only are they buggy, the actual cost of them was prohibitive too.

The only implementation I ever did was what then became 'lockref' (so
this was about a decade ago), and using transactional memory was
actually quite noticeably *slower* than just using a spinlock in the
common cases (particularly the uncontended case - which is actually
the common one by far, despite some benchmarks).

And while the argument could be that transactional memory has improved
in the last decade (narrator: "It hasn't"), the problem is actually
quite fundamental.

The whole "best case scenario" for transactional memory concept is
literally optimized and designed for the rare circumstance - the case
where you have contention, but where the locking part is basically an
idempotent operation (so "lock+unlock" ends up undoing each other and
can use shared cachelines rather than bouncing the cacheline).

And contention pretty much happens in one situation, and one situation
only: in a combination of (a) bad locking and (b) benchmarks.

And for the kernel, where we don't have bad locking, and where we
actually use fine-grained locks that are _near_ the data that we are
locking (the lockref of the dcache is obviously one example of that,
but the skbuff queue you mention is almost certainly exactly the same
situation): the lock is right by the data that the lock protects, and
the "shared lock cacheline" model simply does not work. You'll bounce
the data, and most likely you'll also touch the same lock cacheline
too.

I was quite excited about transactional memory, but have (obviously)
come to the conclusion that it was one of those "looks good in theory,
but is just garbage in reality".

So this isn't an "the x86 implementation was bad" issue. This is a
"transactional memory is a bad idea" situation. It complicates the
most important part of the core CPU (the memory pipeline) enormously,
and it doesn't actually really work.

Now, with that "transactional memory is garbage" in mind, let me just
mention that there are obviously other circumstances:

 (a) horrendously bad locking

If you have a single centralized lock, and you really have serious
contention on it in real life all the time (and not just under
microbenchmarks), and the data that gets touched is spread out all
over, and you simply cannot fix the locking, then all those
theoretical advantages of transactional memory can actually come into
play.

But again: even then, there's an absolutely huge cost to the memory
pipeline.  So that advantage really doesn't come free. Nobody has ever
gotten transactional memory to actually work well in real life. Not
Intel, not IBM with powerpc HTM.

That should tell you something. The hw complexity cost is very very real.

But Intel actually had some loads where TSX helped. I think it was SAP
HANA, and I really think that what you should take away from that is
that SAP HANA locking is likely horrible garbage inside. But maybe I'm
biased, and maybe SAP HANA is lovely, and maybe it just happened to
work really well for TSX. I've never actually looked at that load, but
if I were a betting man...

 (b) very *limited* transactions are probably a great idea

LL/SC is arguably a form of "transactional memory", just with a single
word. Now, I think atomics are generally usually better than LL/SC
when it really is just a single word (because most of the time,
"single word" also means "simple semantics", and while CMPXCHG loops
aren't lovely when the semantics are a bit more complex, it's actually
nice and portable and works at a higher level than assembler sequences
and as such is actually a *lot* better than LL/SC in practice).

But a "N-word LL/SC with forward progress guarantees" would be
objectively stronger than atomics, and would allow for doing low-level
data structures in ways that atomics don't. Doubly linked lists
together with a lock check would argue for "N" being ~5 memory
operations. So I do believe in potentially *limited* memory
transactions, when they can be made limited enough that you have

 (1) forward progress guarantees with no "separate fallback on
contention" garbage

 (2) can keep the hw implementation simple enough with just a store
buffer and an (architecturally) very limited number of cacheline
accesses that you can actually order in HW so that you don't have ABBA
situations.

But the generic kind of transactional memory that Intel tried with TSX
(and IBM with HTM) is pure garbage, and almost certainly always will
be.

And all of the above is obviously just my opinionated input on it. But
I just happen to be right.

                Linus

David Howells Jan. 18, 2023, 9:05 a.m. UTC | #32

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And for the kernel, where we don't have bad locking, and where we
> actually use fine-grained locks that are _near_ the data that we are
> locking (the lockref of the dcache is obviously one example of that,
> but the skbuff queue you mention is almost certainly exactly the same
> situation): the lock is right by the data that the lock protects, and
> the "shared lock cacheline" model simply does not work. You'll bounce
> the data, and most likely you'll also touch the same lock cacheline
> too.

Yeah.  The reason I was actually wondering about them was if it would be
possible to avoid the requirement to disable interrupts/softirqs to, say,
modify the skbuff queue.  On some arches actually disabling irqs is quite a
heavy operation (I think this is/was true on ppc64, for example; it certainly
was on frv) and it was necessary to "emulate" the disablement.

David

Nicholas Piggin Jan. 19, 2023, 1:41 a.m. UTC | #33

On Wed Jan 18, 2023 at 7:05 PM AEST, David Howells wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > And for the kernel, where we don't have bad locking, and where we
> > actually use fine-grained locks that are _near_ the data that we are
> > locking (the lockref of the dcache is obviously one example of that,
> > but the skbuff queue you mention is almost certainly exactly the same
> > situation): the lock is right by the data that the lock protects, and
> > the "shared lock cacheline" model simply does not work. You'll bounce
> > the data, and most likely you'll also touch the same lock cacheline
> > too.
>
> Yeah.  The reason I was actually wondering about them was if it would be
> possible to avoid the requirement to disable interrupts/softirqs to, say,
> modify the skbuff queue.  On some arches actually disabling irqs is quite a
> heavy operation (I think this is/was true on ppc64, for example; it certainly
> was on frv) and it was necessary to "emulate" the disablement.

Not too bad on modern ppc64. Changing MSR in general has to flush the
pipe and even re-fetch, because it can alter memory translation among
other things, so it was heavy. Everything we support has a lightweight
MSR change that just modifies the interrupt enable bit and only needs
minor serialisation (although we still have that software-irq-disable
thing which avoids the heavy MSR problem on old CPUs).

Thanks,
Nick

lockref scalability on x86-64 vs cpu_relax

Commit Message

Comments

Patch