Message ID | 49ae72d6-f5f-5cd-e480-e2212cb7af97@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [next,1/3] truncate,shmem: Fix data loss when hole punched in folio | expand |
On Sun, Jan 02, 2022 at 05:35:50PM -0800, Hugh Dickins wrote: > shmem_swapin_page()'s swap_free() has occasionally been generating > "_swap_info_get: Unused swap offset entry" messages. Usually that's > no worse than noise; but perhaps it indicates a worse case, when we > might there be freeing swap already reused by others. > > The multi-index xas_find_conflict() loop in shmem_add_to_page_cache() > did not allow for entry found NULL when expected to be non-NULL, so did > not catch that race when the swap has already been freed. > > The loop would not actually catch a realistic conflict which the single > check does not catch, so revert it back to the single check. I think what led to the loop was concern for the xa_state if trying to find a swap entry that's smaller than the size of the folio. So yes, the loop was expected to execute twice, but I didn't consider the case where we were looking for something non-NULL and actually found NULL. So should we actually call xas_find_conflict() twice (if we're looking for something non-NULL), and check that we get @expected, followed by NULL?
On Mon, 3 Jan 2022, Matthew Wilcox wrote: > On Sun, Jan 02, 2022 at 05:35:50PM -0800, Hugh Dickins wrote: > > shmem_swapin_page()'s swap_free() has occasionally been generating > > "_swap_info_get: Unused swap offset entry" messages. Usually that's > > no worse than noise; but perhaps it indicates a worse case, when we > > might there be freeing swap already reused by others. > > > > The multi-index xas_find_conflict() loop in shmem_add_to_page_cache() > > did not allow for entry found NULL when expected to be non-NULL, so did > > not catch that race when the swap has already been freed. > > > > The loop would not actually catch a realistic conflict which the single > > check does not catch, so revert it back to the single check. > > I think what led to the loop was concern for the xa_state if trying > to find a swap entry that's smaller than the size of the folio. > So yes, the loop was expected to execute twice, but I didn't consider > the case where we were looking for something non-NULL and actually found > NULL. > > So should we actually call xas_find_conflict() twice (if we're looking > for something non-NULL), and check that we get @expected, followed by > NULL? Sorry, I've no idea. You say "twice", and that does not fit the imaginary model I had when I said "The loop would not actually catch a realistic conflict which the single check does not catch". I was imagining it either looking at a single entry, or looking at an array of (perhaps sometimes in shmem's case 512) entries, looking for conflict with the supplied pointer/value expected there. The loop technique was already unable to report on unexpected NULLs, and the single test would catch a non-NULL entry different from an expected non-NULL entry. Its only relative weakness appeared to be if that array contained (perhaps some NULLs then) a "narrow" instance of the same pointer/value that was expected to fill the array; and I didn't see any possibility for shmem to be inserting small and large folios sharing the same address at the same time. That "explanation" may make no sense to you, don't worry about it; just as "twice" makes no immediate sense to me - I'd have to go off and study multi-index XArray to make sense of it, which I'm not about to do. I've seen no problems with the proposed patch, but if you see a real case that it's failing to cover, yes, please do improve it of course. Though now I'm wondering if the "loop" totally misled me; and your "twice" just means that we need to test first this and then that and we're done - yeah, maybe. Hugh
On Mon, Jan 03, 2022 at 12:10:21PM -0800, Hugh Dickins wrote: > On Mon, 3 Jan 2022, Matthew Wilcox wrote: > > On Sun, Jan 02, 2022 at 05:35:50PM -0800, Hugh Dickins wrote: > > > shmem_swapin_page()'s swap_free() has occasionally been generating > > > "_swap_info_get: Unused swap offset entry" messages. Usually that's > > > no worse than noise; but perhaps it indicates a worse case, when we > > > might there be freeing swap already reused by others. > > > > > > The multi-index xas_find_conflict() loop in shmem_add_to_page_cache() > > > did not allow for entry found NULL when expected to be non-NULL, so did > > > not catch that race when the swap has already been freed. > > > > > > The loop would not actually catch a realistic conflict which the single > > > check does not catch, so revert it back to the single check. > > > > I think what led to the loop was concern for the xa_state if trying > > to find a swap entry that's smaller than the size of the folio. > > So yes, the loop was expected to execute twice, but I didn't consider > > the case where we were looking for something non-NULL and actually found > > NULL. > > > > So should we actually call xas_find_conflict() twice (if we're looking > > for something non-NULL), and check that we get @expected, followed by > > NULL? > > Sorry, I've no idea. > > You say "twice", and that does not fit the imaginary model I had when I > said "The loop would not actually catch a realistic conflict which the > single check does not catch". > > I was imagining it either looking at a single entry, or looking at an > array of (perhaps sometimes in shmem's case 512) entries, looking for > conflict with the supplied pointer/value expected there. > > The loop technique was already unable to report on unexpected NULLs, > and the single test would catch a non-NULL entry different from an > expected non-NULL entry. Its only relative weakness appeared to be > if that array contained (perhaps some NULLs then) a "narrow" instance > of the same pointer/value that was expected to fill the array; and I > didn't see any possibility for shmem to be inserting small and large > folios sharing the same address at the same time. > > That "explanation" may make no sense to you, don't worry about it; > just as "twice" makes no immediate sense to me - I'd have to go off > and study multi-index XArray to make sense of it, which I'm not > about to do. > > I've seen no problems with the proposed patch, but if you see a real > case that it's failing to cover, yes, please do improve it of course. > > Though now I'm wondering if the "loop" totally misled me; and your > "twice" just means that we need to test first this and then that and > we're done - yeah, maybe. Sorry; I wrote the above in a hurry and early in the morning, so probably even less coherent than usual. Also, the multi-index xarray code is new to everyone (and adds new things to consider), so it's no surprise we're talking past each other. It's a bit strange for me to read your explanations because you're only reading what I actually wrote instead of what I intended to write. So let me try again. My concern was that we might be trying to store a 2MB entry which had a non-NULL 'expected' entry which was found in a 4k (ie single-index) slot within the 512 entries (as the first non-NULL entry in that range), and we'd then store the 2MB entry into a single-entry slot. Now, maybe that can't happen for higher-level reasons, and I don't need to worry about it. But I feel like we should check for that? Anyway, I think the right fix is this: +++ b/mm/shmem.c @@ -733,11 +733,12 @@ static int shmem_add_to_page_cache(struct page *page, cgroup_throttle_swaprate(page, gfp); do { - void *entry; xas_lock_irq(&xas); - while ((entry = xas_find_conflict(&xas)) != NULL) { - if (entry == expected) - continue; + if (expected != xas_find_conflict(&xas)) { + xas_set_err(&xas, -EEXIST); + goto unlock; + } + if (expected && xas_find_conflict(&xas)) { xas_set_err(&xas, -EEXIST); goto unlock; } which says what I mean. I certainly didn't intend to imply that I was expecting to see 512 consecutive entries which were all identical, which would be the idiomatic way to read the code that was there before. I shouldn't've tried to be so concise. (If you'd rather I write any of this differently, I'm more than happy to change it)
On Tue, 4 Jan 2022, Matthew Wilcox wrote: > On Mon, Jan 03, 2022 at 12:10:21PM -0800, Hugh Dickins wrote: > > On Mon, 3 Jan 2022, Matthew Wilcox wrote: > > > On Sun, Jan 02, 2022 at 05:35:50PM -0800, Hugh Dickins wrote: > > > > shmem_swapin_page()'s swap_free() has occasionally been generating > > > > "_swap_info_get: Unused swap offset entry" messages. Usually that's > > > > no worse than noise; but perhaps it indicates a worse case, when we > > > > might there be freeing swap already reused by others. > > > > > > > > The multi-index xas_find_conflict() loop in shmem_add_to_page_cache() > > > > did not allow for entry found NULL when expected to be non-NULL, so did > > > > not catch that race when the swap has already been freed. > > > > > > > > The loop would not actually catch a realistic conflict which the single > > > > check does not catch, so revert it back to the single check. > > > > > > I think what led to the loop was concern for the xa_state if trying > > > to find a swap entry that's smaller than the size of the folio. > > > So yes, the loop was expected to execute twice, but I didn't consider > > > the case where we were looking for something non-NULL and actually found > > > NULL. > > > > > > So should we actually call xas_find_conflict() twice (if we're looking > > > for something non-NULL), and check that we get @expected, followed by > > > NULL? > > > > Sorry, I've no idea. > > > > You say "twice", and that does not fit the imaginary model I had when I > > said "The loop would not actually catch a realistic conflict which the > > single check does not catch". > > > > I was imagining it either looking at a single entry, or looking at an > > array of (perhaps sometimes in shmem's case 512) entries, looking for > > conflict with the supplied pointer/value expected there. > > > > The loop technique was already unable to report on unexpected NULLs, > > and the single test would catch a non-NULL entry different from an > > expected non-NULL entry. Its only relative weakness appeared to be > > if that array contained (perhaps some NULLs then) a "narrow" instance > > of the same pointer/value that was expected to fill the array; and I > > didn't see any possibility for shmem to be inserting small and large > > folios sharing the same address at the same time. > > > > That "explanation" may make no sense to you, don't worry about it; > > just as "twice" makes no immediate sense to me - I'd have to go off > > and study multi-index XArray to make sense of it, which I'm not > > about to do. > > > > I've seen no problems with the proposed patch, but if you see a real > > case that it's failing to cover, yes, please do improve it of course. > > > > Though now I'm wondering if the "loop" totally misled me; and your > > "twice" just means that we need to test first this and then that and > > we're done - yeah, maybe. > > Sorry; I wrote the above in a hurry and early in the morning, so probably > even less coherent than usual. Also, the multi-index xarray code is > new to everyone (and adds new things to consider), so it's no surprise > we're talking past each other. It's a bit strange for me to read your > explanations because you're only reading what I actually wrote instead > of what I intended to write. > > So let me try again. My concern was that we might be trying to store > a 2MB entry which had a non-NULL 'expected' entry which was found in a > 4k (ie single-index) slot within the 512 entries (as the first non-NULL > entry in that range), and we'd then store the 2MB entry into a > single-entry slot. Thanks, that sounds much more like how I was imagining it. And the two separate tests much more understandable than twice round a loop. > > Now, maybe that can't happen for higher-level reasons, and I don't need > to worry about it. But I feel like we should check for that? Anyway, > I think the right fix is this: I don't object to (cheaply) excluding a possibility at the low level, even if there happen to be high level reasons why it cannot happen at present. But I don't think your second xas_find_conflict() gives quite as much assurance as you're expecting of it. Since xas_find_conflict() skips NULLs, the conflict test would pass if there were 511 NULLs and one 4k entry matching the expected entry, but a 2MB entry to be inserted (the "small and large folios" case in my earlier ramblings). I think xas_find_conflict() is not really up to giving that assurance; and maybe better than a second call to xas_find_conflict(), might be a VM_BUG_ON earlier, to say that if 'expected' is non-NULL, then the range is PAGE_SIZE - or something more folio-friendly like that? That would give you the extra assurance you're looking for, wouldn't it? For now and the known future, shmem only swaps PAGE_SIZE, out and in; maybe someone will want to change that one day, then xas_find_conflict() could be enhanced to know more of what's expected. > > +++ b/mm/shmem.c > @@ -733,11 +733,12 @@ static int shmem_add_to_page_cache(struct page *page, > cgroup_throttle_swaprate(page, gfp); > > do { > - void *entry; > xas_lock_irq(&xas); > - while ((entry = xas_find_conflict(&xas)) != NULL) { > - if (entry == expected) > - continue; > + if (expected != xas_find_conflict(&xas)) { > + xas_set_err(&xas, -EEXIST); > + goto unlock; > + } > + if (expected && xas_find_conflict(&xas)) { > xas_set_err(&xas, -EEXIST); > goto unlock; > } That also worried me because, if the second xas_find_conflict() is to make any sense, the first must have had a side-effect on xas: are those side-effects okay for the subsequent xas_store(&xas, page)? You'll know that they are, but it's not obvious to the reader. > > which says what I mean. I certainly didn't intend to imply that I > was expecting to see 512 consecutive entries which were all identical, > which would be the idiomatic way to read the code that was there before. > I shouldn't've tried to be so concise. > > (If you'd rather I write any of this differently, I'm more than happy > to change it) No, I'm happy with the style of it, just discontented that the second xas_find_conflict() pretends to more than it provides (I think). Hugh
On Mon, Jan 03, 2022 at 08:43:13PM -0800, Hugh Dickins wrote: > On Tue, 4 Jan 2022, Matthew Wilcox wrote: > > So let me try again. My concern was that we might be trying to store > > a 2MB entry which had a non-NULL 'expected' entry which was found in a > > 4k (ie single-index) slot within the 512 entries (as the first non-NULL > > entry in that range), and we'd then store the 2MB entry into a > > single-entry slot. > > Thanks, that sounds much more like how I was imagining it. And the > two separate tests much more understandable than twice round a loop. > > > Now, maybe that can't happen for higher-level reasons, and I don't need > > to worry about it. But I feel like we should check for that? Anyway, > > I think the right fix is this: > > I don't object to (cheaply) excluding a possibility at the low level, > even if there happen to be high level reasons why it cannot happen at > present. > > But I don't think your second xas_find_conflict() gives quite as much > assurance as you're expecting of it. Since xas_find_conflict() skips > NULLs, the conflict test would pass if there were 511 NULLs and one > 4k entry matching the expected entry, but a 2MB entry to be inserted > (the "small and large folios" case in my earlier ramblings). > > I think xas_find_conflict() is not really up to giving that assurance; > and maybe better than a second call to xas_find_conflict(), might be > a VM_BUG_ON earlier, to say that if 'expected' is non-NULL, then the > range is PAGE_SIZE - or something more folio-friendly like that? > That would give you the extra assurance you're looking for, > wouldn't it? I actually don't need that assurance. The wretch who wrote the documentation for xas_find_conflict() needs to be put on a diet of bread and water, but the promise that it should make is that once it returns NULL, the xa_state is restored to where it was before the first call to xas_find_conflict(). So if you're trying to store a 2MB entry and the only swap entry in the 2MB range is a 4KB entry, the xa_state gets walked back up to point at the original 512-aligned entry and the subsequent xas_store() will free the node containing the swap entry. > For now and the known future, shmem only swaps PAGE_SIZE, out and in; > maybe someone will want to change that one day, then xas_find_conflict() > could be enhanced to know more of what's expected. Good to know. > > > > +++ b/mm/shmem.c > > @@ -733,11 +733,12 @@ static int shmem_add_to_page_cache(struct page *page, > > cgroup_throttle_swaprate(page, gfp); > > > > do { > > - void *entry; > > xas_lock_irq(&xas); > > - while ((entry = xas_find_conflict(&xas)) != NULL) { > > - if (entry == expected) > > - continue; > > + if (expected != xas_find_conflict(&xas)) { > > + xas_set_err(&xas, -EEXIST); > > + goto unlock; > > + } > > + if (expected && xas_find_conflict(&xas)) { > > xas_set_err(&xas, -EEXIST); > > goto unlock; > > } > > That also worried me because, if the second xas_find_conflict() > is to make any sense, the first must have had a side-effect on xas: > are those side-effects okay for the subsequent xas_store(&xas, page)? > You'll know that they are, but it's not obvious to the reader. > > > > > which says what I mean. I certainly didn't intend to imply that I > > was expecting to see 512 consecutive entries which were all identical, > > which would be the idiomatic way to read the code that was there before. > > I shouldn't've tried to be so concise. > > > > (If you'd rather I write any of this differently, I'm more than happy > > to change it) > > No, I'm happy with the style of it, just discontented that the second > xas_find_conflict() pretends to more than it provides (I think). > > Hugh
--- hughd2/mm/shmem.c +++ hughd3/mm/shmem.c @@ -727,9 +727,8 @@ static int shmem_add_to_page_cache(struc do { void *entry; xas_lock_irq(&xas); - while ((entry = xas_find_conflict(&xas)) != NULL) { - if (entry == expected) - continue; + entry = xas_find_conflict(&xas); + if (entry != expected) { xas_set_err(&xas, -EEXIST); goto unlock; }
shmem_swapin_page()'s swap_free() has occasionally been generating "_swap_info_get: Unused swap offset entry" messages. Usually that's no worse than noise; but perhaps it indicates a worse case, when we might there be freeing swap already reused by others. The multi-index xas_find_conflict() loop in shmem_add_to_page_cache() did not allow for entry found NULL when expected to be non-NULL, so did not catch that race when the swap has already been freed. The loop would not actually catch a realistic conflict which the single check does not catch, so revert it back to the single check. Fixes: 3103f9a51dd0 ("mm: Use multi-index entries in the page cache") Signed-off-by: Hugh Dickins <hughd@google.com> --- mm/shmem.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)