diff mbox series

[next,3/3] shmem: Fix "Unused swap" messages

Message ID 49ae72d6-f5f-5cd-e480-e2212cb7af97@google.com (mailing list archive)
State New, archived
Headers show
Series [next,1/3] truncate,shmem: Fix data loss when hole punched in folio | expand

Commit Message

Hugh Dickins Jan. 3, 2022, 1:35 a.m. UTC
shmem_swapin_page()'s swap_free() has occasionally been generating
"_swap_info_get: Unused swap offset entry" messages.  Usually that's
no worse than noise; but perhaps it indicates a worse case, when we
might there be freeing swap already reused by others.

The multi-index xas_find_conflict() loop in shmem_add_to_page_cache()
did not allow for entry found NULL when expected to be non-NULL, so did
not catch that race when the swap has already been freed.

The loop would not actually catch a realistic conflict which the single
check does not catch, so revert it back to the single check.

Fixes: 3103f9a51dd0 ("mm: Use multi-index entries in the page cache")
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/shmem.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Matthew Wilcox Jan. 3, 2022, 3:36 p.m. UTC | #1
On Sun, Jan 02, 2022 at 05:35:50PM -0800, Hugh Dickins wrote:
> shmem_swapin_page()'s swap_free() has occasionally been generating
> "_swap_info_get: Unused swap offset entry" messages.  Usually that's
> no worse than noise; but perhaps it indicates a worse case, when we
> might there be freeing swap already reused by others.
> 
> The multi-index xas_find_conflict() loop in shmem_add_to_page_cache()
> did not allow for entry found NULL when expected to be non-NULL, so did
> not catch that race when the swap has already been freed.
> 
> The loop would not actually catch a realistic conflict which the single
> check does not catch, so revert it back to the single check.

I think what led to the loop was concern for the xa_state if trying
to find a swap entry that's smaller than the size of the folio.
So yes, the loop was expected to execute twice, but I didn't consider
the case where we were looking for something non-NULL and actually found
NULL.

So should we actually call xas_find_conflict() twice (if we're looking
for something non-NULL), and check that we get @expected, followed by
NULL?
Hugh Dickins Jan. 3, 2022, 8:10 p.m. UTC | #2
On Mon, 3 Jan 2022, Matthew Wilcox wrote:
> On Sun, Jan 02, 2022 at 05:35:50PM -0800, Hugh Dickins wrote:
> > shmem_swapin_page()'s swap_free() has occasionally been generating
> > "_swap_info_get: Unused swap offset entry" messages.  Usually that's
> > no worse than noise; but perhaps it indicates a worse case, when we
> > might there be freeing swap already reused by others.
> > 
> > The multi-index xas_find_conflict() loop in shmem_add_to_page_cache()
> > did not allow for entry found NULL when expected to be non-NULL, so did
> > not catch that race when the swap has already been freed.
> > 
> > The loop would not actually catch a realistic conflict which the single
> > check does not catch, so revert it back to the single check.
> 
> I think what led to the loop was concern for the xa_state if trying
> to find a swap entry that's smaller than the size of the folio.
> So yes, the loop was expected to execute twice, but I didn't consider
> the case where we were looking for something non-NULL and actually found
> NULL.
> 
> So should we actually call xas_find_conflict() twice (if we're looking
> for something non-NULL), and check that we get @expected, followed by
> NULL?

Sorry, I've no idea.

You say "twice", and that does not fit the imaginary model I had when I
said "The loop would not actually catch a realistic conflict which the
single check does not catch".

I was imagining it either looking at a single entry, or looking at an
array of (perhaps sometimes in shmem's case 512) entries, looking for
conflict with the supplied pointer/value expected there.

The loop technique was already unable to report on unexpected NULLs,
and the single test would catch a non-NULL entry different from an
expected non-NULL entry.  Its only relative weakness appeared to be
if that array contained (perhaps some NULLs then) a "narrow" instance
of the same pointer/value that was expected to fill the array; and I
didn't see any possibility for shmem to be inserting small and large
folios sharing the same address at the same time.

That "explanation" may make no sense to you, don't worry about it;
just as "twice" makes no immediate sense to me - I'd have to go off
and study multi-index XArray to make sense of it, which I'm not
about to do.

I've seen no problems with the proposed patch, but if you see a real
case that it's failing to cover, yes, please do improve it of course.

Though now I'm wondering if the "loop" totally misled me; and your
"twice" just means that we need to test first this and then that and
we're done - yeah, maybe.

Hugh
Matthew Wilcox Jan. 4, 2022, 1:41 a.m. UTC | #3
On Mon, Jan 03, 2022 at 12:10:21PM -0800, Hugh Dickins wrote:
> On Mon, 3 Jan 2022, Matthew Wilcox wrote:
> > On Sun, Jan 02, 2022 at 05:35:50PM -0800, Hugh Dickins wrote:
> > > shmem_swapin_page()'s swap_free() has occasionally been generating
> > > "_swap_info_get: Unused swap offset entry" messages.  Usually that's
> > > no worse than noise; but perhaps it indicates a worse case, when we
> > > might there be freeing swap already reused by others.
> > > 
> > > The multi-index xas_find_conflict() loop in shmem_add_to_page_cache()
> > > did not allow for entry found NULL when expected to be non-NULL, so did
> > > not catch that race when the swap has already been freed.
> > > 
> > > The loop would not actually catch a realistic conflict which the single
> > > check does not catch, so revert it back to the single check.
> > 
> > I think what led to the loop was concern for the xa_state if trying
> > to find a swap entry that's smaller than the size of the folio.
> > So yes, the loop was expected to execute twice, but I didn't consider
> > the case where we were looking for something non-NULL and actually found
> > NULL.
> > 
> > So should we actually call xas_find_conflict() twice (if we're looking
> > for something non-NULL), and check that we get @expected, followed by
> > NULL?
> 
> Sorry, I've no idea.
> 
> You say "twice", and that does not fit the imaginary model I had when I
> said "The loop would not actually catch a realistic conflict which the
> single check does not catch".
> 
> I was imagining it either looking at a single entry, or looking at an
> array of (perhaps sometimes in shmem's case 512) entries, looking for
> conflict with the supplied pointer/value expected there.
> 
> The loop technique was already unable to report on unexpected NULLs,
> and the single test would catch a non-NULL entry different from an
> expected non-NULL entry.  Its only relative weakness appeared to be
> if that array contained (perhaps some NULLs then) a "narrow" instance
> of the same pointer/value that was expected to fill the array; and I
> didn't see any possibility for shmem to be inserting small and large
> folios sharing the same address at the same time.
> 
> That "explanation" may make no sense to you, don't worry about it;
> just as "twice" makes no immediate sense to me - I'd have to go off
> and study multi-index XArray to make sense of it, which I'm not
> about to do.
> 
> I've seen no problems with the proposed patch, but if you see a real
> case that it's failing to cover, yes, please do improve it of course.
> 
> Though now I'm wondering if the "loop" totally misled me; and your
> "twice" just means that we need to test first this and then that and
> we're done - yeah, maybe.

Sorry; I wrote the above in a hurry and early in the morning, so probably
even less coherent than usual.  Also, the multi-index xarray code is
new to everyone (and adds new things to consider), so it's no surprise
we're talking past each other.  It's a bit strange for me to read your
explanations because you're only reading what I actually wrote instead
of what I intended to write.

So let me try again.  My concern was that we might be trying to store
a 2MB entry which had a non-NULL 'expected' entry which was found in a
4k (ie single-index) slot within the 512 entries (as the first non-NULL
entry in that range), and we'd then store the 2MB entry into a
single-entry slot.

Now, maybe that can't happen for higher-level reasons, and I don't need
to worry about it.  But I feel like we should check for that?  Anyway,
I think the right fix is this:

+++ b/mm/shmem.c
@@ -733,11 +733,12 @@ static int shmem_add_to_page_cache(struct page *page,
        cgroup_throttle_swaprate(page, gfp);

        do {
-               void *entry;
                xas_lock_irq(&xas);
-               while ((entry = xas_find_conflict(&xas)) != NULL) {
-                       if (entry == expected)
-                               continue;
+               if (expected != xas_find_conflict(&xas)) {
+                       xas_set_err(&xas, -EEXIST);
+                       goto unlock;
+               }
+               if (expected && xas_find_conflict(&xas)) {
                        xas_set_err(&xas, -EEXIST);
                        goto unlock;
                }

which says what I mean.  I certainly didn't intend to imply that I
was expecting to see 512 consecutive entries which were all identical,
which would be the idiomatic way to read the code that was there before.
I shouldn't've tried to be so concise.

(If you'd rather I write any of this differently, I'm more than happy
to change it)
Hugh Dickins Jan. 4, 2022, 4:43 a.m. UTC | #4
On Tue, 4 Jan 2022, Matthew Wilcox wrote:
> On Mon, Jan 03, 2022 at 12:10:21PM -0800, Hugh Dickins wrote:
> > On Mon, 3 Jan 2022, Matthew Wilcox wrote:
> > > On Sun, Jan 02, 2022 at 05:35:50PM -0800, Hugh Dickins wrote:
> > > > shmem_swapin_page()'s swap_free() has occasionally been generating
> > > > "_swap_info_get: Unused swap offset entry" messages.  Usually that's
> > > > no worse than noise; but perhaps it indicates a worse case, when we
> > > > might there be freeing swap already reused by others.
> > > > 
> > > > The multi-index xas_find_conflict() loop in shmem_add_to_page_cache()
> > > > did not allow for entry found NULL when expected to be non-NULL, so did
> > > > not catch that race when the swap has already been freed.
> > > > 
> > > > The loop would not actually catch a realistic conflict which the single
> > > > check does not catch, so revert it back to the single check.
> > > 
> > > I think what led to the loop was concern for the xa_state if trying
> > > to find a swap entry that's smaller than the size of the folio.
> > > So yes, the loop was expected to execute twice, but I didn't consider
> > > the case where we were looking for something non-NULL and actually found
> > > NULL.
> > > 
> > > So should we actually call xas_find_conflict() twice (if we're looking
> > > for something non-NULL), and check that we get @expected, followed by
> > > NULL?
> > 
> > Sorry, I've no idea.
> > 
> > You say "twice", and that does not fit the imaginary model I had when I
> > said "The loop would not actually catch a realistic conflict which the
> > single check does not catch".
> > 
> > I was imagining it either looking at a single entry, or looking at an
> > array of (perhaps sometimes in shmem's case 512) entries, looking for
> > conflict with the supplied pointer/value expected there.
> > 
> > The loop technique was already unable to report on unexpected NULLs,
> > and the single test would catch a non-NULL entry different from an
> > expected non-NULL entry.  Its only relative weakness appeared to be
> > if that array contained (perhaps some NULLs then) a "narrow" instance
> > of the same pointer/value that was expected to fill the array; and I
> > didn't see any possibility for shmem to be inserting small and large
> > folios sharing the same address at the same time.
> > 
> > That "explanation" may make no sense to you, don't worry about it;
> > just as "twice" makes no immediate sense to me - I'd have to go off
> > and study multi-index XArray to make sense of it, which I'm not
> > about to do.
> > 
> > I've seen no problems with the proposed patch, but if you see a real
> > case that it's failing to cover, yes, please do improve it of course.
> > 
> > Though now I'm wondering if the "loop" totally misled me; and your
> > "twice" just means that we need to test first this and then that and
> > we're done - yeah, maybe.
> 
> Sorry; I wrote the above in a hurry and early in the morning, so probably
> even less coherent than usual.  Also, the multi-index xarray code is
> new to everyone (and adds new things to consider), so it's no surprise
> we're talking past each other.  It's a bit strange for me to read your
> explanations because you're only reading what I actually wrote instead
> of what I intended to write.
> 
> So let me try again.  My concern was that we might be trying to store
> a 2MB entry which had a non-NULL 'expected' entry which was found in a
> 4k (ie single-index) slot within the 512 entries (as the first non-NULL
> entry in that range), and we'd then store the 2MB entry into a
> single-entry slot.

Thanks, that sounds much more like how I was imagining it.  And the
two separate tests much more understandable than twice round a loop.

> 
> Now, maybe that can't happen for higher-level reasons, and I don't need
> to worry about it.  But I feel like we should check for that?  Anyway,
> I think the right fix is this:

I don't object to (cheaply) excluding a possibility at the low level,
even if there happen to be high level reasons why it cannot happen at
present.

But I don't think your second xas_find_conflict() gives quite as much
assurance as you're expecting of it.  Since xas_find_conflict() skips
NULLs, the conflict test would pass if there were 511 NULLs and one
4k entry matching the expected entry, but a 2MB entry to be inserted
(the "small and large folios" case in my earlier ramblings).

I think xas_find_conflict() is not really up to giving that assurance;
and maybe better than a second call to xas_find_conflict(), might be
a VM_BUG_ON earlier, to say that if 'expected' is non-NULL, then the
range is PAGE_SIZE - or something more folio-friendly like that?
That would give you the extra assurance you're looking for,
wouldn't it?

For now and the known future, shmem only swaps PAGE_SIZE, out and in;
maybe someone will want to change that one day, then xas_find_conflict()
could be enhanced to know more of what's expected.

> 
> +++ b/mm/shmem.c
> @@ -733,11 +733,12 @@ static int shmem_add_to_page_cache(struct page *page,
>         cgroup_throttle_swaprate(page, gfp);
> 
>         do {
> -               void *entry;
>                 xas_lock_irq(&xas);
> -               while ((entry = xas_find_conflict(&xas)) != NULL) {
> -                       if (entry == expected)
> -                               continue;
> +               if (expected != xas_find_conflict(&xas)) {
> +                       xas_set_err(&xas, -EEXIST);
> +                       goto unlock;
> +               }
> +               if (expected && xas_find_conflict(&xas)) {
>                         xas_set_err(&xas, -EEXIST);
>                         goto unlock;
>                 }

That also worried me because, if the second xas_find_conflict()
is to make any sense, the first must have had a side-effect on xas:
are those side-effects okay for the subsequent xas_store(&xas, page)?
You'll know that they are, but it's not obvious to the reader.

> 
> which says what I mean.  I certainly didn't intend to imply that I
> was expecting to see 512 consecutive entries which were all identical,
> which would be the idiomatic way to read the code that was there before.
> I shouldn't've tried to be so concise.
> 
> (If you'd rather I write any of this differently, I'm more than happy
> to change it)

No, I'm happy with the style of it, just discontented that the second
xas_find_conflict() pretends to more than it provides (I think).

Hugh
Matthew Wilcox Jan. 4, 2022, 8:12 a.m. UTC | #5
On Mon, Jan 03, 2022 at 08:43:13PM -0800, Hugh Dickins wrote:
> On Tue, 4 Jan 2022, Matthew Wilcox wrote:
> > So let me try again.  My concern was that we might be trying to store
> > a 2MB entry which had a non-NULL 'expected' entry which was found in a
> > 4k (ie single-index) slot within the 512 entries (as the first non-NULL
> > entry in that range), and we'd then store the 2MB entry into a
> > single-entry slot.
> 
> Thanks, that sounds much more like how I was imagining it.  And the
> two separate tests much more understandable than twice round a loop.
> 
> > Now, maybe that can't happen for higher-level reasons, and I don't need
> > to worry about it.  But I feel like we should check for that?  Anyway,
> > I think the right fix is this:
> 
> I don't object to (cheaply) excluding a possibility at the low level,
> even if there happen to be high level reasons why it cannot happen at
> present.
> 
> But I don't think your second xas_find_conflict() gives quite as much
> assurance as you're expecting of it.  Since xas_find_conflict() skips
> NULLs, the conflict test would pass if there were 511 NULLs and one
> 4k entry matching the expected entry, but a 2MB entry to be inserted
> (the "small and large folios" case in my earlier ramblings).
> 
> I think xas_find_conflict() is not really up to giving that assurance;
> and maybe better than a second call to xas_find_conflict(), might be
> a VM_BUG_ON earlier, to say that if 'expected' is non-NULL, then the
> range is PAGE_SIZE - or something more folio-friendly like that?
> That would give you the extra assurance you're looking for,
> wouldn't it?

I actually don't need that assurance.  The wretch who wrote the
documentation for xas_find_conflict() needs to be put on a diet of
bread and water, but the promise that it should make is that once it
returns NULL, the xa_state is restored to where it was before the first
call to xas_find_conflict().  So if you're trying to store a 2MB entry
and the only swap entry in the 2MB range is a 4KB entry, the xa_state
gets walked back up to point at the original 512-aligned entry and the
subsequent xas_store() will free the node containing the swap entry.

> For now and the known future, shmem only swaps PAGE_SIZE, out and in;
> maybe someone will want to change that one day, then xas_find_conflict()
> could be enhanced to know more of what's expected.

Good to know.

> > 
> > +++ b/mm/shmem.c
> > @@ -733,11 +733,12 @@ static int shmem_add_to_page_cache(struct page *page,
> >         cgroup_throttle_swaprate(page, gfp);
> > 
> >         do {
> > -               void *entry;
> >                 xas_lock_irq(&xas);
> > -               while ((entry = xas_find_conflict(&xas)) != NULL) {
> > -                       if (entry == expected)
> > -                               continue;
> > +               if (expected != xas_find_conflict(&xas)) {
> > +                       xas_set_err(&xas, -EEXIST);
> > +                       goto unlock;
> > +               }
> > +               if (expected && xas_find_conflict(&xas)) {
> >                         xas_set_err(&xas, -EEXIST);
> >                         goto unlock;
> >                 }
> 
> That also worried me because, if the second xas_find_conflict()
> is to make any sense, the first must have had a side-effect on xas:
> are those side-effects okay for the subsequent xas_store(&xas, page)?
> You'll know that they are, but it's not obvious to the reader.
> 
> > 
> > which says what I mean.  I certainly didn't intend to imply that I
> > was expecting to see 512 consecutive entries which were all identical,
> > which would be the idiomatic way to read the code that was there before.
> > I shouldn't've tried to be so concise.
> > 
> > (If you'd rather I write any of this differently, I'm more than happy
> > to change it)
> 
> No, I'm happy with the style of it, just discontented that the second
> xas_find_conflict() pretends to more than it provides (I think).
> 
> Hugh
diff mbox series

Patch

--- hughd2/mm/shmem.c
+++ hughd3/mm/shmem.c
@@ -727,9 +727,8 @@  static int shmem_add_to_page_cache(struc
 	do {
 		void *entry;
 		xas_lock_irq(&xas);
-		while ((entry = xas_find_conflict(&xas)) != NULL) {
-			if (entry == expected)
-				continue;
+		entry = xas_find_conflict(&xas);
+		if (entry != expected) {
 			xas_set_err(&xas, -EEXIST);
 			goto unlock;
 		}