Message ID | 1567169011-4748-1-git-send-email-vinmenon@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path | expand |
On Fri 30-08-19 18:13:31, Vinayak Menon wrote: > The following race is observed due to which a processes faulting > on a swap entry, finds the page neither in swapcache nor swap. This > causes zram to give a zero filled page that gets mapped to the > process, resulting in a user space crash later. > > Consider parent and child processes Pa and Pb sharing the same swap > slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. > Virtual address 'VA' of Pa and Pb points to the shared swap entry. > > Pa Pb > > fault on VA fault on VA > do_swap_page do_swap_page > lookup_swap_cache fails lookup_swap_cache fails > Pb scheduled out > swapin_readahead (deletes zram entry) > swap_free (makes swap_count 1) > Pb scheduled in > swap_readpage (swap_count == 1) > Takes SWP_SYNCHRONOUS_IO path > zram enrty absent > zram gives a zero filled page This sounds like a zram issue, right? Why is a generic swap path changed then? > > Fix this by reading the swap_count before lookup_swap_cache, which conforms > with the order in which page is added to swap cache and swap count is > decremented in do_swap_page. In the race case above, this will let Pb take > the readahead path and thus pick the proper page from swapcache. > > Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Hi Michal, Thanks for reviewing this. On 9/2/2019 6:51 PM, Michal Hocko wrote: > On Fri 30-08-19 18:13:31, Vinayak Menon wrote: >> The following race is observed due to which a processes faulting >> on a swap entry, finds the page neither in swapcache nor swap. This >> causes zram to give a zero filled page that gets mapped to the >> process, resulting in a user space crash later. >> >> Consider parent and child processes Pa and Pb sharing the same swap >> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. >> Virtual address 'VA' of Pa and Pb points to the shared swap entry. >> >> Pa Pb >> >> fault on VA fault on VA >> do_swap_page do_swap_page >> lookup_swap_cache fails lookup_swap_cache fails >> Pb scheduled out >> swapin_readahead (deletes zram entry) >> swap_free (makes swap_count 1) >> Pb scheduled in >> swap_readpage (swap_count == 1) >> Takes SWP_SYNCHRONOUS_IO path >> zram enrty absent >> zram gives a zero filled page > This sounds like a zram issue, right? Why is a generic swap path changed > then? I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal. This is because zram avoids lazy swap slot freeing by implementing gendisk->fops->swap_slot_free_notify and swap_slot_free_notify deletes the zram entry because the page is in swapcache. The issue is that Pb attempted a swapcache lookup before Pa brought the page to swapcache, and failed. If Pb had taken the swapin_readahead path, __read_swap_cache_async would have performed a second lookup and found the page in swapcache. The issue here is that due to the lookup failure and swap_count being 1, it takes the SWP_SYNCHRONOUS_IO path and does a direct read which is bound to fail. So it seems to me as a problem in the way SWP_SYNCHRONOUS_IO is handled in do_swap_page, and not a problem with zram. Any swap device that sets SWP_SYNCHRONOUS_IO and implements swap_slot_free_notify can hit this bug. do_swap_page first brings in the page to swapcache and then decrements the swap_count, and SWP_SYNCHRONOUS_IO code in do_swap_page performs the swapcache and swap_count checks in the same order. Due to thread preemption described in the sequence above, it can happen that the SWP_SYNCHRONOUS_IO path fails in swapcache check, but sees the swap_count decremented later, thus missing a valid swapcache entry. I have not tested, but the following patch may also fix the issue. diff --git a/include/linux/swap.h b/include/linux/swap.h index 063c0c1..a5ca05f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **); extern sector_t swapdev_block(int, pgoff_t); extern int page_swapcount(struct page *); extern int __swap_count(swp_entry_t entry); +extern bool __swap_has_cache(swp_entry_t entry); extern int __swp_swapcount(swp_entry_t entry); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry) return 0; } +static bool __swap_has_cache(swp_entry_t entry) +{ + return 0; +} + static inline int __swp_swapcount(swp_entry_t entry) { return 0; diff --git a/mm/memory.c b/mm/memory.c index e0c232f..a13511f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) struct swap_info_struct *si = swp_swap_info(entry); if (si->flags & SWP_SYNCHRONOUS_IO && - __swap_count(entry) == 1) { + __swap_count(entry) == 1 && + !__swap_has_cache(entry)) { /* skip swapcache */ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); diff --git a/mm/swapfile.c b/mm/swapfile.c index 80445f4..2a1554a8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry) return count; } +bool __swap_has_cache(swp_entry_t entry) +{ + struct swap_info_struct *si; + pgoff_t offset = swp_offset(entry); + bool has_cache = false; + + si = get_swap_device(entry); + if (si) { + has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE); + put_swap_device(si); + } + return has_cache; +} + static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry) { int count = 0; > >> Fix this by reading the swap_count before lookup_swap_cache, which conforms >> with the order in which page is added to swap cache and swap count is >> decremented in do_swap_page. In the race case above, this will let Pb take >> the readahead path and thus pick the proper page from swapcache. >> >> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
On Tue 03-09-19 11:43:16, Vinayak Menon wrote: > Hi Michal, > > Thanks for reviewing this. > > > On 9/2/2019 6:51 PM, Michal Hocko wrote: > > On Fri 30-08-19 18:13:31, Vinayak Menon wrote: > >> The following race is observed due to which a processes faulting > >> on a swap entry, finds the page neither in swapcache nor swap. This > >> causes zram to give a zero filled page that gets mapped to the > >> process, resulting in a user space crash later. > >> > >> Consider parent and child processes Pa and Pb sharing the same swap > >> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. > >> Virtual address 'VA' of Pa and Pb points to the shared swap entry. > >> > >> Pa Pb > >> > >> fault on VA fault on VA > >> do_swap_page do_swap_page > >> lookup_swap_cache fails lookup_swap_cache fails > >> Pb scheduled out > >> swapin_readahead (deletes zram entry) > >> swap_free (makes swap_count 1) > >> Pb scheduled in > >> swap_readpage (swap_count == 1) > >> Takes SWP_SYNCHRONOUS_IO path > >> zram enrty absent > >> zram gives a zero filled page > > This sounds like a zram issue, right? Why is a generic swap path changed > > then? > > > I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal. Isn't that a data loss? The race you mentioned shouldn't be possible with the standard swap storage AFAIU. If that is really the case then the zram needs a fix rather than a generic path. Or at least a very good explanation why the generic path is a preferred way.
On 9/3/2019 5:11 PM, Michal Hocko wrote: > On Tue 03-09-19 11:43:16, Vinayak Menon wrote: >> Hi Michal, >> >> Thanks for reviewing this. >> >> >> On 9/2/2019 6:51 PM, Michal Hocko wrote: >>> On Fri 30-08-19 18:13:31, Vinayak Menon wrote: >>>> The following race is observed due to which a processes faulting >>>> on a swap entry, finds the page neither in swapcache nor swap. This >>>> causes zram to give a zero filled page that gets mapped to the >>>> process, resulting in a user space crash later. >>>> >>>> Consider parent and child processes Pa and Pb sharing the same swap >>>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. >>>> Virtual address 'VA' of Pa and Pb points to the shared swap entry. >>>> >>>> Pa Pb >>>> >>>> fault on VA fault on VA >>>> do_swap_page do_swap_page >>>> lookup_swap_cache fails lookup_swap_cache fails >>>> Pb scheduled out >>>> swapin_readahead (deletes zram entry) >>>> swap_free (makes swap_count 1) >>>> Pb scheduled in >>>> swap_readpage (swap_count == 1) >>>> Takes SWP_SYNCHRONOUS_IO path >>>> zram enrty absent >>>> zram gives a zero filled page >>> This sounds like a zram issue, right? Why is a generic swap path changed >>> then? >> >> I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal. > Isn't that a data loss? The race you mentioned shouldn't be possible > with the standard swap storage AFAIU. If that is really the case then > the zram needs a fix rather than a generic path. Or at least a very good > explanation why the generic path is a preferred way. AFAIK, there isn't a data loss because, before deleting the entry, swap_slot_free_notify makes sure that page is in swapcache and marks the page dirty to ensure a swap out before reclaim. I am referring to the comment about this in swap_slot_free_notify. In the case of this race too, the page brought to swapcache by Pa is still in swapcache. It is just that Pb failed to find it due to the race. Yes, this race will not happen for standard swap storage and only for those block devices that set disk->fops->swap_slot_free_notify and have SWP_SYNCHRONOUS_IO set (which seems to be only zram). Now considering that zram works as expected, the fix is in generic path because the race is due to the bug in SWP_SYNCHRONOUS_IO handling in do_swap_page. And it is only the SWP_SYNCHRONOUS_IO handling in generic path which is modified.
On 9/3/2019 5:47 PM, Vinayak Menon wrote: > On 9/3/2019 5:11 PM, Michal Hocko wrote: >> On Tue 03-09-19 11:43:16, Vinayak Menon wrote: >>> Hi Michal, >>> >>> Thanks for reviewing this. >>> >>> >>> On 9/2/2019 6:51 PM, Michal Hocko wrote: >>>> On Fri 30-08-19 18:13:31, Vinayak Menon wrote: >>>>> The following race is observed due to which a processes faulting >>>>> on a swap entry, finds the page neither in swapcache nor swap. This >>>>> causes zram to give a zero filled page that gets mapped to the >>>>> process, resulting in a user space crash later. >>>>> >>>>> Consider parent and child processes Pa and Pb sharing the same swap >>>>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. >>>>> Virtual address 'VA' of Pa and Pb points to the shared swap entry. >>>>> >>>>> Pa Pb >>>>> >>>>> fault on VA fault on VA >>>>> do_swap_page do_swap_page >>>>> lookup_swap_cache fails lookup_swap_cache fails >>>>> Pb scheduled out >>>>> swapin_readahead (deletes zram entry) >>>>> swap_free (makes swap_count 1) >>>>> Pb scheduled in >>>>> swap_readpage (swap_count == 1) >>>>> Takes SWP_SYNCHRONOUS_IO path >>>>> zram enrty absent >>>>> zram gives a zero filled page >>>> This sounds like a zram issue, right? Why is a generic swap path changed >>>> then? >>> I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal. >> Isn't that a data loss? The race you mentioned shouldn't be possible >> with the standard swap storage AFAIU. If that is really the case then >> the zram needs a fix rather than a generic path. Or at least a very good >> explanation why the generic path is a preferred way. > > AFAIK, there isn't a data loss because, before deleting the entry, swap_slot_free_notify makes sure that > > page is in swapcache and marks the page dirty to ensure a swap out before reclaim. I am referring to the > > comment about this in swap_slot_free_notify. In the case of this race too, the page brought to swapcache > > by Pa is still in swapcache. It is just that Pb failed to find it due to the race. > > Yes, this race will not happen for standard swap storage and only for those block devices that set > > disk->fops->swap_slot_free_notify and have SWP_SYNCHRONOUS_IO set (which seems to be only zram). > > Now considering that zram works as expected, the fix is in generic path because the race is due to the bug in > > SWP_SYNCHRONOUS_IO handling in do_swap_page. And it is only the SWP_SYNCHRONOUS_IO handling in > > generic path which is modified. > Hi Michal, Do you see any concerns with the patch or explanation of the problem ?
On Mon 09-09-19 09:35:39, Vinayak Menon wrote: > > On 9/3/2019 5:47 PM, Vinayak Menon wrote: > > On 9/3/2019 5:11 PM, Michal Hocko wrote: > >> On Tue 03-09-19 11:43:16, Vinayak Menon wrote: > >>> Hi Michal, > >>> > >>> Thanks for reviewing this. > >>> > >>> > >>> On 9/2/2019 6:51 PM, Michal Hocko wrote: > >>>> On Fri 30-08-19 18:13:31, Vinayak Menon wrote: > >>>>> The following race is observed due to which a processes faulting > >>>>> on a swap entry, finds the page neither in swapcache nor swap. This > >>>>> causes zram to give a zero filled page that gets mapped to the > >>>>> process, resulting in a user space crash later. > >>>>> > >>>>> Consider parent and child processes Pa and Pb sharing the same swap > >>>>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. > >>>>> Virtual address 'VA' of Pa and Pb points to the shared swap entry. > >>>>> > >>>>> Pa Pb > >>>>> > >>>>> fault on VA fault on VA > >>>>> do_swap_page do_swap_page > >>>>> lookup_swap_cache fails lookup_swap_cache fails > >>>>> Pb scheduled out > >>>>> swapin_readahead (deletes zram entry) > >>>>> swap_free (makes swap_count 1) > >>>>> Pb scheduled in > >>>>> swap_readpage (swap_count == 1) > >>>>> Takes SWP_SYNCHRONOUS_IO path > >>>>> zram enrty absent > >>>>> zram gives a zero filled page > >>>> This sounds like a zram issue, right? Why is a generic swap path changed > >>>> then? > >>> I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal. > >> Isn't that a data loss? The race you mentioned shouldn't be possible > >> with the standard swap storage AFAIU. If that is really the case then > >> the zram needs a fix rather than a generic path. Or at least a very good > >> explanation why the generic path is a preferred way. > > > > AFAIK, there isn't a data loss because, before deleting the entry, swap_slot_free_notify makes sure that > > > > page is in swapcache and marks the page dirty to ensure a swap out before reclaim. I am referring to the > > > > comment about this in swap_slot_free_notify. In the case of this race too, the page brought to swapcache > > > > by Pa is still in swapcache. It is just that Pb failed to find it due to the race. > > > > Yes, this race will not happen for standard swap storage and only for those block devices that set > > > > disk->fops->swap_slot_free_notify and have SWP_SYNCHRONOUS_IO set (which seems to be only zram). > > > > Now considering that zram works as expected, the fix is in generic path because the race is due to the bug in > > > > SWP_SYNCHRONOUS_IO handling in do_swap_page. And it is only the SWP_SYNCHRONOUS_IO handling in > > > > generic path which is modified. > > > > Hi Michal, > > Do you see any concerns with the patch or explanation of the problem ? I am sorry, I didn't have time to give this a more serious thought. You need somebody more familiar with the code and time to look into it.
Hi Vinayak, On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote: > The following race is observed due to which a processes faulting > on a swap entry, finds the page neither in swapcache nor swap. This > causes zram to give a zero filled page that gets mapped to the > process, resulting in a user space crash later. > > Consider parent and child processes Pa and Pb sharing the same swap > slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. > Virtual address 'VA' of Pa and Pb points to the shared swap entry. > > Pa Pb > > fault on VA fault on VA > do_swap_page do_swap_page > lookup_swap_cache fails lookup_swap_cache fails > Pb scheduled out > swapin_readahead (deletes zram entry) > swap_free (makes swap_count 1) > Pb scheduled in > swap_readpage (swap_count == 1) > Takes SWP_SYNCHRONOUS_IO path > zram enrty absent > zram gives a zero filled page > > Fix this by reading the swap_count before lookup_swap_cache, which conforms > with the order in which page is added to swap cache and swap count is > decremented in do_swap_page. In the race case above, this will let Pb take > the readahead path and thus pick the proper page from swapcache. Thanks for the report, Vinayak. It's a zram specific issue because it deallocates zram block unconditionally once read IO is done. The expectation was that dirty page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true any more so I want to resolve the issue in zram specific code, not general one. A idea in my mind is swap_slot_free_notify should check the slot reference counter and if it's higher than 1, it shouldn't free the slot until. What do you think about? > > Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> > --- > mm/memory.c | 21 ++++++++++++++++----- > 1 file changed, 16 insertions(+), 5 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index e0c232f..22643aa 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > struct page *page = NULL, *swapcache; > struct mem_cgroup *memcg; > swp_entry_t entry; > + struct swap_info_struct *si; > + bool skip_swapcache = false; > pte_t pte; > int locked; > int exclusive = 0; > @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > delayacct_set_flag(DELAYACCT_PF_SWAPIN); > + > + /* > + * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO > + * check is made, another process can populate the swapcache, delete > + * the swap entry and decrement the swap count. So decide on taking > + * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the > + * race described, the victim process will find a swap_count > 1 > + * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO. > + */ > + si = swp_swap_info(entry); > + if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1) > + skip_swapcache = true; > + > page = lookup_swap_cache(entry, vma, vmf->address); > swapcache = page; > > if (!page) { > - struct swap_info_struct *si = swp_swap_info(entry); > - > - if (si->flags & SWP_SYNCHRONOUS_IO && > - __swap_count(entry) == 1) { > - /* skip swapcache */ > + if (skip_swapcache) { > page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, > vmf->address); > if (page) { > -- > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a > member of the Code Aurora Forum, hosted by The Linux Foundation >
Hi Minchan, On 9/10/2019 4:56 AM, Minchan Kim wrote: > Hi Vinayak, > > On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote: >> The following race is observed due to which a processes faulting >> on a swap entry, finds the page neither in swapcache nor swap. This >> causes zram to give a zero filled page that gets mapped to the >> process, resulting in a user space crash later. >> >> Consider parent and child processes Pa and Pb sharing the same swap >> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. >> Virtual address 'VA' of Pa and Pb points to the shared swap entry. >> >> Pa Pb >> >> fault on VA fault on VA >> do_swap_page do_swap_page >> lookup_swap_cache fails lookup_swap_cache fails >> Pb scheduled out >> swapin_readahead (deletes zram entry) >> swap_free (makes swap_count 1) >> Pb scheduled in >> swap_readpage (swap_count == 1) >> Takes SWP_SYNCHRONOUS_IO path >> zram enrty absent >> zram gives a zero filled page >> >> Fix this by reading the swap_count before lookup_swap_cache, which conforms >> with the order in which page is added to swap cache and swap count is >> decremented in do_swap_page. In the race case above, this will let Pb take >> the readahead path and thus pick the proper page from swapcache. > Thanks for the report, Vinayak. > > It's a zram specific issue because it deallocates zram block > unconditionally once read IO is done. The expectation was that dirty > page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true > any more so I want to resolve the issue in zram specific code, not > general one. Thanks for comments Minchan. Trying to understand your comment better. With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache. > > A idea in my mind is swap_slot_free_notify should check the slot > reference counter and if it's higher than 1, it shouldn't free the > slot until. What do you think about? It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing a valid swapcache entry ? Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ? diff --git a/include/linux/swap.h b/include/linux/swap.h index 063c0c1..a5ca05f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **); extern sector_t swapdev_block(int, pgoff_t); extern int page_swapcount(struct page *); extern int __swap_count(swp_entry_t entry); +extern bool __swap_has_cache(swp_entry_t entry); extern int __swp_swapcount(swp_entry_t entry); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry) return 0; } +static bool __swap_has_cache(swp_entry_t entry) +{ + return 0; +} + static inline int __swp_swapcount(swp_entry_t entry) { return 0; diff --git a/mm/memory.c b/mm/memory.c index e0c232f..a13511f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) struct swap_info_struct *si = swp_swap_info(entry); if (si->flags & SWP_SYNCHRONOUS_IO && - __swap_count(entry) == 1) { + __swap_count(entry) == 1 && + !__swap_has_cache(entry)) { /* skip swapcache */ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); diff --git a/mm/swapfile.c b/mm/swapfile.c index 80445f4..2a1554a8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry) return count; } +bool __swap_has_cache(swp_entry_t entry) +{ + struct swap_info_struct *si; + pgoff_t offset = swp_offset(entry); + bool has_cache = false; + + si = get_swap_device(entry); + if (si) { + has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE); + put_swap_device(si); + } + return has_cache; +} + static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry) { int count = 0; > >> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> >> --- >> mm/memory.c | 21 ++++++++++++++++----- >> 1 file changed, 16 insertions(+), 5 deletions(-) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index e0c232f..22643aa 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> struct page *page = NULL, *swapcache; >> struct mem_cgroup *memcg; >> swp_entry_t entry; >> + struct swap_info_struct *si; >> + bool skip_swapcache = false; >> pte_t pte; >> int locked; >> int exclusive = 0; >> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> >> >> delayacct_set_flag(DELAYACCT_PF_SWAPIN); >> + >> + /* >> + * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO >> + * check is made, another process can populate the swapcache, delete >> + * the swap entry and decrement the swap count. So decide on taking >> + * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the >> + * race described, the victim process will find a swap_count > 1 >> + * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO. >> + */ >> + si = swp_swap_info(entry); >> + if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1) >> + skip_swapcache = true; >> + >> page = lookup_swap_cache(entry, vma, vmf->address); >> swapcache = page; >> >> if (!page) { >> - struct swap_info_struct *si = swp_swap_info(entry); >> - >> - if (si->flags & SWP_SYNCHRONOUS_IO && >> - __swap_count(entry) == 1) { >> - /* skip swapcache */ >> + if (skip_swapcache) { >> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, >> vmf->address); >> if (page) { >> -- >> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a >> member of the Code Aurora Forum, hosted by The Linux Foundation >>
On Tue, Sep 10, 2019 at 01:52:36PM +0530, Vinayak Menon wrote: > Hi Minchan, > > > On 9/10/2019 4:56 AM, Minchan Kim wrote: > > Hi Vinayak, > > > > On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote: > >> The following race is observed due to which a processes faulting > >> on a swap entry, finds the page neither in swapcache nor swap. This > >> causes zram to give a zero filled page that gets mapped to the > >> process, resulting in a user space crash later. > >> > >> Consider parent and child processes Pa and Pb sharing the same swap > >> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. > >> Virtual address 'VA' of Pa and Pb points to the shared swap entry. > >> > >> Pa Pb > >> > >> fault on VA fault on VA > >> do_swap_page do_swap_page > >> lookup_swap_cache fails lookup_swap_cache fails > >> Pb scheduled out > >> swapin_readahead (deletes zram entry) > >> swap_free (makes swap_count 1) > >> Pb scheduled in > >> swap_readpage (swap_count == 1) > >> Takes SWP_SYNCHRONOUS_IO path > >> zram enrty absent > >> zram gives a zero filled page > >> > >> Fix this by reading the swap_count before lookup_swap_cache, which conforms > >> with the order in which page is added to swap cache and swap count is > >> decremented in do_swap_page. In the race case above, this will let Pb take > >> the readahead path and thus pick the proper page from swapcache. > > Thanks for the report, Vinayak. > > > > It's a zram specific issue because it deallocates zram block > > unconditionally once read IO is done. The expectation was that dirty > > page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true > > any more so I want to resolve the issue in zram specific code, not > > general one. > > > Thanks for comments Minchan. > > Trying to understand your comment better. With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will > > make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid > > entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the > > race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or > > SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram > > entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache. > > > > > > A idea in my mind is swap_slot_free_notify should check the slot > > reference counter and if it's higher than 1, it shouldn't free the > > slot until. What do you think about? > > It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which > > can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing It's always trade-off between memory vs performance since it could hit in swap cache. If it's shared page, it's likely to hit a cache next time so we could get performance benefit. Actually, swap_slot_free_notify is layering violation so I wanted to replace it with discard hint in the long run so want to go the direction. > > a valid swapcache entry ? > > Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ? With your approach, what prevent below scenario? A B do_swap_page SWP_SYNCHRONOUS_IO && __swap_count == 1 shrink_page_list add_to_swap swap_count = 2 .. .. do_swap_page swap_read swap_slot_free_notify zram's slot will be removed page = alloc_page_vma swap_readpage <-- read zero > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 063c0c1..a5ca05f 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **); > extern sector_t swapdev_block(int, pgoff_t); > extern int page_swapcount(struct page *); > extern int __swap_count(swp_entry_t entry); > +extern bool __swap_has_cache(swp_entry_t entry); > extern int __swp_swapcount(swp_entry_t entry); > extern int swp_swapcount(swp_entry_t entry); > extern struct swap_info_struct *page_swap_info(struct page *); > @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry) > return 0; > } > > +static bool __swap_has_cache(swp_entry_t entry) > +{ > + return 0; > +} > + > static inline int __swp_swapcount(swp_entry_t entry) > { > return 0; > > diff --git a/mm/memory.c b/mm/memory.c > index e0c232f..a13511f 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > struct swap_info_struct *si = swp_swap_info(entry); > > if (si->flags & SWP_SYNCHRONOUS_IO && > - __swap_count(entry) == 1) { > + __swap_count(entry) == 1 && > + !__swap_has_cache(entry)) { > /* skip swapcache */ > page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, > vmf->address); > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 80445f4..2a1554a8 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry) > return count; > } > > +bool __swap_has_cache(swp_entry_t entry) > +{ > + struct swap_info_struct *si; > + pgoff_t offset = swp_offset(entry); > + bool has_cache = false; > + > + si = get_swap_device(entry); > + if (si) { > + has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE); > + put_swap_device(si); > + } > + return has_cache; > +} > + > static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry) > { > int count = 0; > > > > > >> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> > >> --- > >> mm/memory.c | 21 ++++++++++++++++----- > >> 1 file changed, 16 insertions(+), 5 deletions(-) > >> > >> diff --git a/mm/memory.c b/mm/memory.c > >> index e0c232f..22643aa 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >> struct page *page = NULL, *swapcache; > >> struct mem_cgroup *memcg; > >> swp_entry_t entry; > >> + struct swap_info_struct *si; > >> + bool skip_swapcache = false; > >> pte_t pte; > >> int locked; > >> int exclusive = 0; > >> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >> > >> > >> delayacct_set_flag(DELAYACCT_PF_SWAPIN); > >> + > >> + /* > >> + * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO > >> + * check is made, another process can populate the swapcache, delete > >> + * the swap entry and decrement the swap count. So decide on taking > >> + * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the > >> + * race described, the victim process will find a swap_count > 1 > >> + * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO. > >> + */ > >> + si = swp_swap_info(entry); > >> + if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1) > >> + skip_swapcache = true; > >> + > >> page = lookup_swap_cache(entry, vma, vmf->address); > >> swapcache = page; > >> > >> if (!page) { > >> - struct swap_info_struct *si = swp_swap_info(entry); > >> - > >> - if (si->flags & SWP_SYNCHRONOUS_IO && > >> - __swap_count(entry) == 1) { > >> - /* skip swapcache */ > >> + if (skip_swapcache) { > >> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, > >> vmf->address); > >> if (page) { > >> -- > >> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a > >> member of the Code Aurora Forum, hosted by The Linux Foundation > >>
On 9/10/2019 11:21 PM, Minchan Kim wrote: > On Tue, Sep 10, 2019 at 01:52:36PM +0530, Vinayak Menon wrote: >> Hi Minchan, >> >> >> On 9/10/2019 4:56 AM, Minchan Kim wrote: >>> Hi Vinayak, >>> >>> On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote: >>>> The following race is observed due to which a processes faulting >>>> on a swap entry, finds the page neither in swapcache nor swap. This >>>> causes zram to give a zero filled page that gets mapped to the >>>> process, resulting in a user space crash later. >>>> >>>> Consider parent and child processes Pa and Pb sharing the same swap >>>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. >>>> Virtual address 'VA' of Pa and Pb points to the shared swap entry. >>>> >>>> Pa Pb >>>> >>>> fault on VA fault on VA >>>> do_swap_page do_swap_page >>>> lookup_swap_cache fails lookup_swap_cache fails >>>> Pb scheduled out >>>> swapin_readahead (deletes zram entry) >>>> swap_free (makes swap_count 1) >>>> Pb scheduled in >>>> swap_readpage (swap_count == 1) >>>> Takes SWP_SYNCHRONOUS_IO path >>>> zram enrty absent >>>> zram gives a zero filled page >>>> >>>> Fix this by reading the swap_count before lookup_swap_cache, which conforms >>>> with the order in which page is added to swap cache and swap count is >>>> decremented in do_swap_page. In the race case above, this will let Pb take >>>> the readahead path and thus pick the proper page from swapcache. >>> Thanks for the report, Vinayak. >>> >>> It's a zram specific issue because it deallocates zram block >>> unconditionally once read IO is done. The expectation was that dirty >>> page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true >>> any more so I want to resolve the issue in zram specific code, not >>> general one. >> >> Thanks for comments Minchan. >> >> Trying to understand your comment better. With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will >> >> make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid >> >> entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the >> >> race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or >> >> SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram >> >> entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache. >> >> >>> A idea in my mind is swap_slot_free_notify should check the slot >>> reference counter and if it's higher than 1, it shouldn't free the >>> slot until. What do you think about? >> It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which >> >> can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing > It's always trade-off between memory vs performance since it could hit > in swap cache. If it's shared page, it's likely to hit a cache next time > so we could get performance benefit. > > Actually, swap_slot_free_notify is layering violation so I wanted to > replace it with discard hint in the long run so want to go the direction. Okay got it. > >> a valid swapcache entry ? >> >> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ? > With your approach, what prevent below scenario? > > A B > > do_swap_page > SWP_SYNCHRONOUS_IO && __swap_count == 1 As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now) If so, that read itself would have deleted the zram entry ? And the read page will be in swapcache and dirty ? In that case, with SWAP_HAS_CACHE check in the patch, B will take readahead path. And shrink_page_list would attempt a pageout to zram, for the dirty page ? > shrink_page_list > add_to_swap > swap_count = 2 > > .. > .. > do_swap_page > swap_read > swap_slot_free_notify > zram's slot will be removed > page = alloc_page_vma > swap_readpage <-- read zero > > >> >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index 063c0c1..a5ca05f 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **); >> extern sector_t swapdev_block(int, pgoff_t); >> extern int page_swapcount(struct page *); >> extern int __swap_count(swp_entry_t entry); >> +extern bool __swap_has_cache(swp_entry_t entry); >> extern int __swp_swapcount(swp_entry_t entry); >> extern int swp_swapcount(swp_entry_t entry); >> extern struct swap_info_struct *page_swap_info(struct page *); >> @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry) >> return 0; >> } >> >> +static bool __swap_has_cache(swp_entry_t entry) >> +{ >> + return 0; >> +} >> + >> static inline int __swp_swapcount(swp_entry_t entry) >> { >> return 0; >> >> diff --git a/mm/memory.c b/mm/memory.c >> index e0c232f..a13511f 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> struct swap_info_struct *si = swp_swap_info(entry); >> >> if (si->flags & SWP_SYNCHRONOUS_IO && >> - __swap_count(entry) == 1) { >> + __swap_count(entry) == 1 && >> + !__swap_has_cache(entry)) { >> /* skip swapcache */ >> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, >> vmf->address); >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 80445f4..2a1554a8 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry) >> return count; >> } >> >> +bool __swap_has_cache(swp_entry_t entry) >> +{ >> + struct swap_info_struct *si; >> + pgoff_t offset = swp_offset(entry); >> + bool has_cache = false; >> + >> + si = get_swap_device(entry); >> + if (si) { >> + has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE); >> + put_swap_device(si); >> + } >> + return has_cache; >> +} >> + >> static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry) >> { >> int count = 0; >> >> >>>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> >>>> --- >>>> mm/memory.c | 21 ++++++++++++++++----- >>>> 1 file changed, 16 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index e0c232f..22643aa 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> struct page *page = NULL, *swapcache; >>>> struct mem_cgroup *memcg; >>>> swp_entry_t entry; >>>> + struct swap_info_struct *si; >>>> + bool skip_swapcache = false; >>>> pte_t pte; >>>> int locked; >>>> int exclusive = 0; >>>> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> >>>> >>>> delayacct_set_flag(DELAYACCT_PF_SWAPIN); >>>> + >>>> + /* >>>> + * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO >>>> + * check is made, another process can populate the swapcache, delete >>>> + * the swap entry and decrement the swap count. So decide on taking >>>> + * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the >>>> + * race described, the victim process will find a swap_count > 1 >>>> + * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO. >>>> + */ >>>> + si = swp_swap_info(entry); >>>> + if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1) >>>> + skip_swapcache = true; >>>> + >>>> page = lookup_swap_cache(entry, vma, vmf->address); >>>> swapcache = page; >>>> >>>> if (!page) { >>>> - struct swap_info_struct *si = swp_swap_info(entry); >>>> - >>>> - if (si->flags & SWP_SYNCHRONOUS_IO && >>>> - __swap_count(entry) == 1) { >>>> - /* skip swapcache */ >>>> + if (skip_swapcache) { >>>> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, >>>> vmf->address); >>>> if (page) { >>>> -- >>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a >>>> member of the Code Aurora Forum, hosted by The Linux Foundation >>>>
Hi Vinayak, On Wed, Sep 11, 2019 at 03:37:23PM +0530, Vinayak Menon wrote: < snip > > >> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ? > > With your approach, what prevent below scenario? > > > > A B > > > > do_swap_page > > SWP_SYNCHRONOUS_IO && __swap_count == 1 > > > As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read > > the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now) It could happen after B saw __swap_count == 1. Think about forking new process. In that case, swap_count is 2 and the forked process will access the page(it ends up freeing zram slot but the page would be swap cache. However, B process doesn't know it).
On 9/12/2019 10:44 PM, Minchan Kim wrote: > Hi Vinayak, > > On Wed, Sep 11, 2019 at 03:37:23PM +0530, Vinayak Menon wrote: > > < snip > > >>>> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ? >>> With your approach, what prevent below scenario? >>> >>> A B >>> >>> do_swap_page >>> SWP_SYNCHRONOUS_IO && __swap_count == 1 >> >> As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read >> >> the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now) > It could happen after B saw __swap_count == 1. Think about forking new process. > In that case, swap_count is 2 and the forked process will access the page(it > ends up freeing zram slot but the page would be swap cache. However, B process > doesn't know it). Okay, so when B has read __swap_count == 1, it means that it has taken down_read on mmap_sem in fault path already. This means fork will not be able to proceed which needs to have down_write on parent's mmap_sem ?
Hi Vinayak, On Fri, Sep 13, 2019 at 02:35:41PM +0530, Vinayak Menon wrote: > > On 9/12/2019 10:44 PM, Minchan Kim wrote: > > Hi Vinayak, > > > > On Wed, Sep 11, 2019 at 03:37:23PM +0530, Vinayak Menon wrote: > > > > < snip > > > > >>>> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ? > >>> With your approach, what prevent below scenario? > >>> > >>> A B > >>> > >>> do_swap_page > >>> SWP_SYNCHRONOUS_IO && __swap_count == 1 > >> > >> As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read > >> > >> the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now) > > It could happen after B saw __swap_count == 1. Think about forking new process. > > In that case, swap_count is 2 and the forked process will access the page(it > > ends up freeing zram slot but the page would be swap cache. However, B process > > doesn't know it). > > > Okay, so when B has read __swap_count == 1, it means that it has taken down_read on mmap_sem in fault path > > already. This means fork will not be able to proceed which needs to have down_write on parent's mmap_sem ? > You are exactly right. However, I still believe better option to solve the issue is to check swap_count and delte only if swap_count == 1 in swap_slot_free_notify because it's zram specific issue and more safe without depending other lock scheme.
On 9/17/2019 1:35 AM, Minchan Kim wrote: > Hi Vinayak, > > On Fri, Sep 13, 2019 at 02:35:41PM +0530, Vinayak Menon wrote: >> On 9/12/2019 10:44 PM, Minchan Kim wrote: >>> Hi Vinayak, >>> >>> On Wed, Sep 11, 2019 at 03:37:23PM +0530, Vinayak Menon wrote: >>> >>> < snip > >>> >>>>>> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ? >>>>> With your approach, what prevent below scenario? >>>>> >>>>> A B >>>>> >>>>> do_swap_page >>>>> SWP_SYNCHRONOUS_IO && __swap_count == 1 >>>> As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read >>>> >>>> the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now) >>> It could happen after B saw __swap_count == 1. Think about forking new process. >>> In that case, swap_count is 2 and the forked process will access the page(it >>> ends up freeing zram slot but the page would be swap cache. However, B process >>> doesn't know it). >> >> Okay, so when B has read __swap_count == 1, it means that it has taken down_read on mmap_sem in fault path >> >> already. This means fork will not be able to proceed which needs to have down_write on parent's mmap_sem ? >> > You are exactly right. However, I still believe better option to solve > the issue is to check swap_count and delte only if swap_count == 1 > in swap_slot_free_notify because it's zram specific issue and more safe > without depending other lock scheme. Sure. Let me know if you want me to post a patch for that.
On Tue, Sep 17, 2019 at 11:08:49AM +0530, Vinayak Menon wrote: > > On 9/17/2019 1:35 AM, Minchan Kim wrote: > > Hi Vinayak, > > > > On Fri, Sep 13, 2019 at 02:35:41PM +0530, Vinayak Menon wrote: > >> On 9/12/2019 10:44 PM, Minchan Kim wrote: > >>> Hi Vinayak, > >>> > >>> On Wed, Sep 11, 2019 at 03:37:23PM +0530, Vinayak Menon wrote: > >>> > >>> < snip > > >>> > >>>>>> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ? > >>>>> With your approach, what prevent below scenario? > >>>>> > >>>>> A B > >>>>> > >>>>> do_swap_page > >>>>> SWP_SYNCHRONOUS_IO && __swap_count == 1 > >>>> As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read > >>>> > >>>> the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now) > >>> It could happen after B saw __swap_count == 1. Think about forking new process. > >>> In that case, swap_count is 2 and the forked process will access the page(it > >>> ends up freeing zram slot but the page would be swap cache. However, B process > >>> doesn't know it). > >> > >> Okay, so when B has read __swap_count == 1, it means that it has taken down_read on mmap_sem in fault path > >> > >> already. This means fork will not be able to proceed which needs to have down_write on parent's mmap_sem ? > >> > > You are exactly right. However, I still believe better option to solve > > the issue is to check swap_count and delte only if swap_count == 1 > > in swap_slot_free_notify because it's zram specific issue and more safe > > without depending other lock scheme. > > > Sure. Let me know if you want me to post a patch for that. > Please post a patch. Thanks, Vinayak!
diff --git a/mm/memory.c b/mm/memory.c index e0c232f..22643aa 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) struct page *page = NULL, *swapcache; struct mem_cgroup *memcg; swp_entry_t entry; + struct swap_info_struct *si; + bool skip_swapcache = false; pte_t pte; int locked; int exclusive = 0; @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) delayacct_set_flag(DELAYACCT_PF_SWAPIN); + + /* + * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO + * check is made, another process can populate the swapcache, delete + * the swap entry and decrement the swap count. So decide on taking + * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the + * race described, the victim process will find a swap_count > 1 + * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO. + */ + si = swp_swap_info(entry); + if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1) + skip_swapcache = true; + page = lookup_swap_cache(entry, vma, vmf->address); swapcache = page; if (!page) { - struct swap_info_struct *si = swp_swap_info(entry); - - if (si->flags & SWP_SYNCHRONOUS_IO && - __swap_count(entry) == 1) { - /* skip swapcache */ + if (skip_swapcache) { page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); if (page) {
The following race is observed due to which a processes faulting on a swap entry, finds the page neither in swapcache nor swap. This causes zram to give a zero filled page that gets mapped to the process, resulting in a user space crash later. Consider parent and child processes Pa and Pb sharing the same swap slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. Virtual address 'VA' of Pa and Pb points to the shared swap entry. Pa Pb fault on VA fault on VA do_swap_page do_swap_page lookup_swap_cache fails lookup_swap_cache fails Pb scheduled out swapin_readahead (deletes zram entry) swap_free (makes swap_count 1) Pb scheduled in swap_readpage (swap_count == 1) Takes SWP_SYNCHRONOUS_IO path zram enrty absent zram gives a zero filled page Fix this by reading the swap_count before lookup_swap_cache, which conforms with the order in which page is added to swap cache and swap count is decremented in do_swap_page. In the race case above, this will let Pb take the readahead path and thus pick the proper page from swapcache. Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> --- mm/memory.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-)