Message ID | 20241207082931.1707465-1-mjguzik@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: remove an avoidable load of page refcount in page_ref_add_unless | expand |
On 07.12.24 09:29, Mateusz Guzik wrote: > Explicitly pre-checking the count adds nothing as atomic_add_unless > starts with doing the same thing. iow no functional changes. I recall that we added that check because with the hugetlb vmemmap optimization, some of the tail pages we don't ever expect to be modified (because they are fake-duplicated) might be mapped R/O. If the arch implementation of atomic_add_unless() would trigger an unconditional write fault, we'd be in trouble. That would likely only be the case if the arch provides a dedicate instruction. atomic_add_unless()->raw_atomic_add_unless() Nobody currently defines arch_atomic_add_unless(). raw_atomic_fetch_add_unless()->arch_atomic_fetch_add_unless() is defined on some architectures. I scanned some of the inline-asm, and I think most of them perform a check first. So this currently looks good to me, but we'll rely on the fact that atomic_add_unless() will never trigger a write fault if the values match. Which makes me wonder if we should document that behavior of atomic_add_unless().
On Mon, Dec 9, 2024 at 10:28 AM David Hildenbrand <david@redhat.com> wrote: > > On 07.12.24 09:29, Mateusz Guzik wrote: > > Explicitly pre-checking the count adds nothing as atomic_add_unless > > starts with doing the same thing. iow no functional changes. > > I recall that we added that check because with the hugetlb vmemmap > optimization, some of the tail pages we don't ever expect to be modified > (because they are fake-duplicated) might be mapped R/O. > > If the arch implementation of atomic_add_unless() would trigger an > unconditional write fault, we'd be in trouble. That would likely only be > the case if the arch provides a dedicate instruction. > > atomic_add_unless()->raw_atomic_add_unless() > > Nobody currently defines arch_atomic_add_unless(). > > raw_atomic_fetch_add_unless()->arch_atomic_fetch_add_unless() is defined > on some architectures. > > I scanned some of the inline-asm, and I think most of them perform a > check first. > Huh. Some arch triggering a write fault despite not changing the value is not something I thought about. Sounds pretty broken to me if any arch was to do it, but then stranger things did happen. However, if this is seen as a real concern, then I think the best way forward is to drop the patch (and maybe instead add a comment what's up with the extra load).
On 09.12.24 11:25, Mateusz Guzik wrote: > On Mon, Dec 9, 2024 at 10:28 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 07.12.24 09:29, Mateusz Guzik wrote: >>> Explicitly pre-checking the count adds nothing as atomic_add_unless >>> starts with doing the same thing. iow no functional changes. >> >> I recall that we added that check because with the hugetlb vmemmap >> optimization, some of the tail pages we don't ever expect to be modified >> (because they are fake-duplicated) might be mapped R/O. >> >> If the arch implementation of atomic_add_unless() would trigger an >> unconditional write fault, we'd be in trouble. That would likely only be >> the case if the arch provides a dedicate instruction. >> >> atomic_add_unless()->raw_atomic_add_unless() >> >> Nobody currently defines arch_atomic_add_unless(). >> >> raw_atomic_fetch_add_unless()->arch_atomic_fetch_add_unless() is defined >> on some architectures. >> >> I scanned some of the inline-asm, and I think most of them perform a >> check first. >> > > Huh. > > Some arch triggering a write fault despite not changing the value is > not something I thought about. Sounds pretty broken to me if any arch > was to do it, but then stranger things did happen. Yeah, it really depends on what the architecture defines. For example, on s390x for "COMPARE AND SWAP" the spec states something like "When the result of the comparison is unequal, the second operand is loaded at the first-operand loca- tion, and the second-operand location remains unchanged. However, on some models, the contents may be fetched and subsequently stored back unchanged at the second-operand location. This update appears to be a block-concurrent interlocked- update reference as observed by other CPUs." So there might be an unconditional store on an instruction where one would not expect it. Something similar-but-different recently popped up on aarch64, which does what one would expect: "The atomic RMW instructions, for example, ldadd, actually does load + add + store in one instruction, it will trigger two page faults per the ARM64 architecture spec, the first fault is a read fault, the second fault is a write fault. Some applications use atomic RMW instructions to populate memory, for example, openjdk uses atomic-add-0 to do pretouch (populate heap memory at launch time) between v18 and v22 in order to permit use of memory concurrently with pretouch." [1] And Christoph commented: "x86 does not do a read fault on atomics so we have an issue htere." [2] I did not check if that is actually true on x86. > > However, if this is seen as a real concern, then I think the best way > forward is to drop the patch (and maybe instead add a comment what's > up with the extra load). I assume we're currently fine because no architecture actually defines such an instruction that would be a problem for add_unless. [1] https://lore.kernel.org/linux-arm-kernel/Zmw1jltdkMrTrT_l@arm.com/T/ [2] https://lore.kernel.org/linux-arm-kernel/c1ba9ba3-b0d6-4c6c-d628-614751d737c2@gentwo.org/
On Mon, Dec 9, 2024 at 11:56 AM David Hildenbrand <david@redhat.com> wrote: > > On 09.12.24 11:25, Mateusz Guzik wrote: > > On Mon, Dec 9, 2024 at 10:28 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 07.12.24 09:29, Mateusz Guzik wrote: > >>> Explicitly pre-checking the count adds nothing as atomic_add_unless > >>> starts with doing the same thing. iow no functional changes. > >> > >> I recall that we added that check because with the hugetlb vmemmap > >> optimization, some of the tail pages we don't ever expect to be modified > >> (because they are fake-duplicated) might be mapped R/O. > >> > >> If the arch implementation of atomic_add_unless() would trigger an > >> unconditional write fault, we'd be in trouble. That would likely only be > >> the case if the arch provides a dedicate instruction. > >> > >> atomic_add_unless()->raw_atomic_add_unless() > >> > >> Nobody currently defines arch_atomic_add_unless(). > >> > >> raw_atomic_fetch_add_unless()->arch_atomic_fetch_add_unless() is defined > >> on some architectures. > >> > >> I scanned some of the inline-asm, and I think most of them perform a > >> check first. > >> > > > > Huh. > > > > Some arch triggering a write fault despite not changing the value is > > not something I thought about. Sounds pretty broken to me if any arch > > was to do it, but then stranger things did happen. > > Yeah, it really depends on what the architecture defines. For example, > on s390x for "COMPARE AND SWAP" the spec states something like > [snip] Well in this context you need to do the initial load to even know what to CAS with, unless you want to blindly do it hoping to get lucky, which I'm assuming no arch is doing. Granted, if there was an architecture which had an actual "cas unless the value is x" then this would not hold, but I don't know of any. [such an extension would be most welcome fwiw] Assuming you indeed want the patch after all, can you sort out adding a comment to atomic_add_unless yourself? ;) I presume you know the right people and whatnot, so this would cut down on back and forth. That is to say I think this thread just about exhausted the time warranted by this patch. No hard feelz if it gets dropped, but then I do strongly suggest adding a justification to the extra load. Cheers,
On 09.12.24 13:33, Mateusz Guzik wrote: > On Mon, Dec 9, 2024 at 11:56 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 09.12.24 11:25, Mateusz Guzik wrote: >>> On Mon, Dec 9, 2024 at 10:28 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 07.12.24 09:29, Mateusz Guzik wrote: >>>>> Explicitly pre-checking the count adds nothing as atomic_add_unless >>>>> starts with doing the same thing. iow no functional changes. >>>> >>>> I recall that we added that check because with the hugetlb vmemmap >>>> optimization, some of the tail pages we don't ever expect to be modified >>>> (because they are fake-duplicated) might be mapped R/O. >>>> >>>> If the arch implementation of atomic_add_unless() would trigger an >>>> unconditional write fault, we'd be in trouble. That would likely only be >>>> the case if the arch provides a dedicate instruction. >>>> >>>> atomic_add_unless()->raw_atomic_add_unless() >>>> >>>> Nobody currently defines arch_atomic_add_unless(). >>>> >>>> raw_atomic_fetch_add_unless()->arch_atomic_fetch_add_unless() is defined >>>> on some architectures. >>>> >>>> I scanned some of the inline-asm, and I think most of them perform a >>>> check first. >>>> >>> >>> Huh. >>> >>> Some arch triggering a write fault despite not changing the value is >>> not something I thought about. Sounds pretty broken to me if any arch >>> was to do it, but then stranger things did happen. >> >> Yeah, it really depends on what the architecture defines. For example, >> on s390x for "COMPARE AND SWAP" the spec states something like >> > [snip] > > Well in this context you need to do the initial load to even know what > to CAS with, unless you want to blindly do it hoping to get lucky, > which I'm assuming no arch is doing. > > Granted, if there was an architecture which had an actual "cas unless > the value is x" then this would not hold, but I don't know of any. > [such an extension would be most welcome fwiw] Apparently, we prepared for that via arch_atomic_add_unless(), which has no users. > > Assuming you indeed want the patch after all, can you sort out adding > a comment to atomic_add_unless yourself? ;) I presume you know the > right people and whatnot, so this would cut down on back and forth. > > That is to say I think this thread just about exhausted the time > warranted by this patch. No hard feelz if it gets dropped, but then I > do strongly suggest adding a justification to the extra load. Maybe it's sufficient for now to simply do your change with a comment: diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 8c236c651d1d6..1efc992ad5687 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -234,7 +234,13 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u) rcu_read_lock(); /* avoid writing to the vmemmap area being remapped */ - if (!page_is_fake_head(page) && page_ref_count(page) != u) + if (!page_is_fake_head(page)) + /* + * atomic_add_unless() will currently never modify the value + * if it already is u. If that ever changes, we'd have to have + * a separate check here, such that we won't be writing to + * write-protected vmemmap areas. + */ ret = atomic_add_unless(&page->_refcount, nr, u); rcu_read_unlock(); It would bail out during testing ... hopefully, such that we can detect any such change.
On Mon, Dec 9, 2024 at 3:22 PM David Hildenbrand <david@redhat.com> wrote: > > On 09.12.24 13:33, Mateusz Guzik wrote: > > That is to say I think this thread just about exhausted the time > > warranted by this patch. No hard feelz if it gets dropped, but then I > > do strongly suggest adding a justification to the extra load. > > Maybe it's sufficient for now to simply do your change with a comment: > > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h > index 8c236c651d1d6..1efc992ad5687 100644 > --- a/include/linux/page_ref.h > +++ b/include/linux/page_ref.h > @@ -234,7 +234,13 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u) > > rcu_read_lock(); > /* avoid writing to the vmemmap area being remapped */ > - if (!page_is_fake_head(page) && page_ref_count(page) != u) > + if (!page_is_fake_head(page)) > + /* > + * atomic_add_unless() will currently never modify the value > + * if it already is u. If that ever changes, we'd have to have > + * a separate check here, such that we won't be writing to > + * write-protected vmemmap areas. > + */ > ret = atomic_add_unless(&page->_refcount, nr, u); > rcu_read_unlock(); > > > It would bail out during testing ... hopefully, such that we can detect any such change. > Not my call to make, but looks good. ;) fwiw I don't need any credit and I would be more than happy if you just submitted the thing as your own without me being mentioned. *No* cc would also be appreciated.
On 09.12.24 15:30, Mateusz Guzik wrote: > On Mon, Dec 9, 2024 at 3:22 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 09.12.24 13:33, Mateusz Guzik wrote: >>> That is to say I think this thread just about exhausted the time >>> warranted by this patch. No hard feelz if it gets dropped, but then I >>> do strongly suggest adding a justification to the extra load. >> >> Maybe it's sufficient for now to simply do your change with a comment: >> >> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h >> index 8c236c651d1d6..1efc992ad5687 100644 >> --- a/include/linux/page_ref.h >> +++ b/include/linux/page_ref.h >> @@ -234,7 +234,13 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u) >> >> rcu_read_lock(); >> /* avoid writing to the vmemmap area being remapped */ >> - if (!page_is_fake_head(page) && page_ref_count(page) != u) >> + if (!page_is_fake_head(page)) >> + /* >> + * atomic_add_unless() will currently never modify the value >> + * if it already is u. If that ever changes, we'd have to have >> + * a separate check here, such that we won't be writing to >> + * write-protected vmemmap areas. >> + */ >> ret = atomic_add_unless(&page->_refcount, nr, u); >> rcu_read_unlock(); >> >> >> It would bail out during testing ... hopefully, such that we can detect any such change. >> > > Not my call to make, but looks good. ;) > > fwiw I don't need any credit and I would be more than happy if you > just submitted the thing as your own without me being mentioned. *No* > cc would also be appreciated. Likely Andrew can add the comment as a fixup.
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 8c236c651d1d..fa203894876f 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -234,7 +234,7 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u) rcu_read_lock(); /* avoid writing to the vmemmap area being remapped */ - if (!page_is_fake_head(page) && page_ref_count(page) != u) + if (!page_is_fake_head(page)) ret = atomic_add_unless(&page->_refcount, nr, u); rcu_read_unlock();
Explicitly pre-checking the count adds nothing as atomic_add_unless starts with doing the same thing. iow no functional changes. disasm of stock filemap_get_read_batch from perf top while running readseek2_processes -t 24: 0.04 │ cb: mov 0x34(%rbx),%eax # first load 73.11 │ test %eax,%eax │ ↓ je 1bd 0.09 │ mov 0x34(%rbx),%eax # second load 1.01 │ d9: test %eax,%eax │ ↓ je 1bd 0.06 │ lea 0x1(%rax),%edx 0.00 │ lea 0x34(%rbx),%r14 0.00 │ lock cmpxchg %edx,0x34(%rbx) 14.06 │ ↑ jne d9 Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- I did not bother benchmarking, I don't think there is anything warranting it for this one. fwiw it plausibly is worth few % in a microbenchmark at higher core count. include/linux/page_ref.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)