diff mbox series

[v3,RESEND] mm: shmem: implement POSIX_FADV_[WILL|DONT]NEED for shmem

Message ID 1641488717-13865-1-git-send-email-quic_charante@quicinc.com (mailing list archive)
State New
Headers show
Series [v3,RESEND] mm: shmem: implement POSIX_FADV_[WILL|DONT]NEED for shmem | expand

Commit Message

Charan Teja Kalla Jan. 6, 2022, 5:05 p.m. UTC
From: Charan Teja Reddy <charante@codeaurora.org>

Currently fadvise(2) is supported only for the files that doesn't
associated with noop_backing_dev_info thus for the files, like shmem,
fadvise results into NOP. But then there is file_operations->fadvise()
that lets the file systems to implement their own fadvise
implementation. Use this support to implement some of the POSIX_FADV_XXX
functionality for shmem files.

This patch aims to implement POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED
advices to shmem files which can be helpful for the drivers who may want
to manage the shmem pages of the files that are created through
shmem_file_setup[_with_mnt]().  An example usecase may be like, driver
can create the shmem file of the size equal to its requirements and
map the pages for DMA and then pass the fd to user. The user who knows
well about the usage of these pages can now decide when these pages are
not required push them to swap through DONTNEED thus free up memory well
in advance rather than relying on the reclaim and use WILLNEED when it
decide that they are useful in the near future. IOW, it lets the clients
to free up/read the memory when it wants to. Another usecase is that GEM
objets which are currenlty allocated and managed through shmem files can
use vfs_fadvise(DONT|WILLNEED) on shmem fd when the driver comes to
know(like through some hints from user space) that GEM objects are not
going to use/will need in the near future.

Some questions asked while reviewing this patch:

Q) Can the same thing be achieved with FD mapped to user and use
madvise?
A) All drivers are not mapping all the shmem fd's to user space and want
to manage them with in the kernel. Ex: shmem memory can be mapped to the
other subsystems and they fill in the data and then give it to other
subsystem for further processing, where, the user mapping is not at all
required.  A simple example, memory that is given for gpu subsystem
which can be filled directly and give to display subsystem. And the
respective drivers know well about when to keep that memory in ram or
swap based on may be a user activity.

Q) Should we add the documentation section in Manual pages?
A) The man[1] pages for the fadvise() whatever says is also applicable
for shmem files. so couldn't feel it correct to add specific to shmem
files separately.
[1] https://linux.die.net/man/2/fadvise

Q) The proposed semantics of POSIX_FADV_DONTNEED is actually similar to
MADV_PAGEOUT and different from MADV_DONTNEED. This is a user facing API
and this difference will cause confusion?
A) man pages [1] says that "POSIX_FADV_DONTNEED attempts to free cached
pages associated with the specified region." This means on issuing this
FADV, it is expected to free the file cache pages. And it is
implementation defined If the dirty pages may be attempted to writeback.
And the unwritten dirty pages will not be freed. So, FADV_DONTNEED also
covers the semantics of MADV_PAGEOUT for file pages and there is no
purpose of PAGEOUT for file pages.
[1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>
---

Changes in V3:
  -- Considered THP pages while doing FADVISE_[DONT|WILL]NEED, identified by Matthew.
  -- xarray used properly, as identified by Matthew.
  -- Excluded mapped pages as it requires unmapping and the man pages of fadvise don't talk about them.
  -- RESEND: Fixed the compilation issue when CONFIG_TMPFS is not defined.

Changes in V2:
  -- Rearranged the code to not to sleep with rcu_lock while using xas_() functionality.
  -- Addressed the comments from Suren.
  -- https://patchwork.kernel.org/project/linux-mm/patch/1638442253-1591-1-git-send-email-quic_charante@quicinc.com/

changes in V1:
  -- Created the interface for fadvise(2) to work on shmem files.
  -- https://patchwork.kernel.org/project/linux-mm/patch/1633701982-22302-1-git-send-email-charante@codeaurora.org/

 mm/shmem.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)

Comments

Mark Hemment Jan. 7, 2022, 12:10 p.m. UTC | #1
On Thu, 6 Jan 2022 at 17:06, Charan Teja Reddy
<quic_charante@quicinc.com> wrote:
>
> From: Charan Teja Reddy <charante@codeaurora.org>
>
> Currently fadvise(2) is supported only for the files that doesn't
> associated with noop_backing_dev_info thus for the files, like shmem,
> fadvise results into NOP. But then there is file_operations->fadvise()
> that lets the file systems to implement their own fadvise
> implementation. Use this support to implement some of the POSIX_FADV_XXX
> functionality for shmem files.
>
> [snip]

> +static int shmem_fadvise_willneed(struct address_space *mapping,
> +                                pgoff_t start, pgoff_t long end)
> +{
> +       XA_STATE(xas, &mapping->i_pages, start);
> +       struct page *page;
> +
> +       rcu_read_lock();
> +       xas_for_each(&xas, page, end) {
> +               if (!xa_is_value(page))
> +                       continue;
> +               xas_pause(&xas);
> +               rcu_read_unlock();
> +
> +               page = shmem_read_mapping_page(mapping, xas.xa_index);
> +               if (!IS_ERR(page))
> +                       put_page(page);
> +
> +               rcu_read_lock();
> +               if (need_resched()) {
> +                       xas_pause(&xas);
> +                       cond_resched_rcu();
> +               }
> +       }
> +       rcu_read_unlock();
> +
> +       return 0;

I have a doubt on referencing xa_index after calling xas_pause().
xas_pause() walks xa_index forward, so will not be the value expected
for the current page.
Also, not necessary to re-call xas_pause() before cond_resched (it is
a no-op).  Would be better to check need_resched() before
rcu_read_lock().

As this loop may call xas_pause() for most iterations, should consider
using xa_for_each() instead (I *think* - still getting up to speed
with XArray).

Mark
Charan Teja Kalla Jan. 10, 2022, 10:21 a.m. UTC | #2
Thanks Mark for the review!!

On 1/7/2022 5:40 PM, Mark Hemment wrote:
> On Thu, 6 Jan 2022 at 17:06, Charan Teja Reddy
> <quic_charante@quicinc.com> wrote:
>>
>> From: Charan Teja Reddy <charante@codeaurora.org>
>>
>> Currently fadvise(2) is supported only for the files that doesn't
>> associated with noop_backing_dev_info thus for the files, like shmem,
>> fadvise results into NOP. But then there is file_operations->fadvise()
>> that lets the file systems to implement their own fadvise
>> implementation. Use this support to implement some of the POSIX_FADV_XXX
>> functionality for shmem files.
>>
>> [snip]
> 
>> +static int shmem_fadvise_willneed(struct address_space *mapping,
>> +                                pgoff_t start, pgoff_t long end)
>> +{
>> +       XA_STATE(xas, &mapping->i_pages, start);
>> +       struct page *page;
>> +
>> +       rcu_read_lock();
>> +       xas_for_each(&xas, page, end) {
>> +               if (!xa_is_value(page))
>> +                       continue;
>> +               xas_pause(&xas);
>> +               rcu_read_unlock();
>> +
>> +               page = shmem_read_mapping_page(mapping, xas.xa_index);
>> +               if (!IS_ERR(page))
>> +                       put_page(page);
>> +
>> +               rcu_read_lock();
>> +               if (need_resched()) {
>> +                       xas_pause(&xas);
>> +                       cond_resched_rcu();
>> +               }
>> +       }
>> +       rcu_read_unlock();
>> +
>> +       return 0;
> 
> I have a doubt on referencing xa_index after calling xas_pause().
> xas_pause() walks xa_index forward, so will not be the value expected
> for the current page.

Agree here. I should have the better test case to verify my changes.

> Also, not necessary to re-call xas_pause() before cond_resched (it is
> a no-op).

In the event when CONFIG_DEBUG_ATOMIC_SLEEP is enabled users may still
need to call the xas_pause(), as we are dropping the rcu lock. NO?

static inline void cond_resched_rcu(void)
{
#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
        rcu_read_unlock();
        cond_resched();
        rcu_read_lock();
#endif
}

> Would be better to check need_resched() before
> rcu_read_lock().

Okay, I can directly use cond_resched() if used before rcu_read_lock().

> 
> As this loop may call xas_pause() for most iterations, should consider
> using xa_for_each() instead (I *think* - still getting up to speed
> with XArray).

Even the xarray documentation says that: If most entries found during a
walk require you to call xas_pause(), the xa_for_each() iterator may be
more appropriate.

Since every value entry found in the xarray requires me to do the
xas_pause(), I do agree that xa_for_each() is the appropriate call here.
Will switch to this in the next spin. Waiting for further review
comments on this patch.

> 
> Mark
>
Mark Hemment Jan. 10, 2022, 12:36 p.m. UTC | #3
On Thu, 6 Jan 2022 at 17:06, Charan Teja Reddy
<quic_charante@quicinc.com> wrote:
>
> From: Charan Teja Reddy <charante@codeaurora.org>
>
> Currently fadvise(2) is supported only for the files that doesn't
> associated with noop_backing_dev_info thus for the files, like shmem,
> fadvise results into NOP. But then there is file_operations->fadvise()
> that lets the file systems to implement their own fadvise
> implementation. Use this support to implement some of the POSIX_FADV_XXX
> functionality for shmem files.
...
> +static void shmem_isolate_pages_range(struct address_space *mapping, loff_t start,
> +                               loff_t end, struct list_head *list)
> +{
> +       XA_STATE(xas, &mapping->i_pages, start);
> +       struct page *page;
> +
> +       rcu_read_lock();
> +       xas_for_each(&xas, page, end) {
> +               if (xas_retry(&xas, page))
> +                       continue;
> +               if (xa_is_value(page))
> +                       continue;
> +               if (!get_page_unless_zero(page))
> +                       continue;
> +               if (isolate_lru_page(page))
> +                       continue;

Need to unwind the get_page on failure to isolate.

Should PageUnevicitable() pages (SHM_LOCK) be skipped?
(That is, does SHM_LOCK override DONTNEED?)

...
> +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start,
> +                               loff_t end)
> +{
> +       int ret;
> +       struct page *page;
> +       LIST_HEAD(list);
> +       struct writeback_control wbc = {
> +               .sync_mode = WB_SYNC_NONE,
> +               .nr_to_write = LONG_MAX,
> +               .range_start = 0,
> +               .range_end = LLONG_MAX,
> +               .for_reclaim = 1,
> +       };
> +
> +       if (!shmem_mapping(mapping))
> +               return -EINVAL;
> +
> +       if (!total_swap_pages)
> +               return 0;
> +
> +       lru_add_drain();
> +       shmem_isolate_pages_range(mapping, start, end, &list);
> +
> +       while (!list_empty(&list)) {
> +               page = lru_to_page(&list);
> +               list_del(&page->lru);
> +               if (page_mapped(page))
> +                       goto keep;
> +               if (!trylock_page(page))
> +                       goto keep;
> +               if (unlikely(PageTransHuge(page))) {
> +                       if (split_huge_page_to_list(page, &list))
> +                               goto keep;
> +               }

I don't know the shmem code and the lifecycle of a shm-page, so
genuine questions;
When the try-lock succeeds, should there be a test for PageWriteback()
(page skipped if true)?  Also, does page->mapping need to be tested
for NULL to prevent races with deletion from the page-cache?

...
> +
> +               clear_page_dirty_for_io(page);
> +               SetPageReclaim(page);
> +               ret = shmem_writepage(page, &wbc);
> +               if (ret || PageWriteback(page)) {
> +                       if (ret)
> +                               unlock_page(page);
> +                       goto keep;
> +               }
> +
> +               if (!PageWriteback(page))
> +                       ClearPageReclaim(page);
> +
> +               /*
> +                * shmem_writepage() place the page in the swapcache.
> +                * Delete the page from the swapcache and release the
> +                * page.
> +                */
> +               __mod_node_page_state(page_pgdat(page),
> +                               NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page));
> +               lock_page(page);
> +               delete_from_swap_cache(page);
> +               unlock_page(page);
> +               put_page(page);
> +               continue;
> +keep:
> +               putback_lru_page(page);
> +               __mod_node_page_state(page_pgdat(page),
> +                               NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page));
> +       }

The putback_lru_page() drops the last reference hold this code has on
'page'.  Is it safe to use 'page' after dropping this reference?

Cheers,
Mark
Charan Teja Kalla Jan. 10, 2022, 3:14 p.m. UTC | #4
Thanks again Mark for the review comments!!

On 1/10/2022 6:06 PM, Mark Hemment wrote:
> On Thu, 6 Jan 2022 at 17:06, Charan Teja Reddy
> <quic_charante@quicinc.com> wrote:
>>
>> From: Charan Teja Reddy <charante@codeaurora.org>
>>
>> Currently fadvise(2) is supported only for the files that doesn't
>> associated with noop_backing_dev_info thus for the files, like shmem,
>> fadvise results into NOP. But then there is file_operations->fadvise()
>> that lets the file systems to implement their own fadvise
>> implementation. Use this support to implement some of the POSIX_FADV_XXX
>> functionality for shmem files.
> ...
>> +static void shmem_isolate_pages_range(struct address_space *mapping, loff_t start,
>> +                               loff_t end, struct list_head *list)
>> +{
>> +       XA_STATE(xas, &mapping->i_pages, start);
>> +       struct page *page;
>> +
>> +       rcu_read_lock();
>> +       xas_for_each(&xas, page, end) {
>> +               if (xas_retry(&xas, page))
>> +                       continue;
>> +               if (xa_is_value(page))
>> +                       continue;
>> +               if (!get_page_unless_zero(page))
>> +                       continue;
>> +               if (isolate_lru_page(page))
>> +                       continue;
> 
> Need to unwind the get_page on failure to isolate.

Will be done.

> 
> Should PageUnevicitable() pages (SHM_LOCK) be skipped?
> (That is, does SHM_LOCK override DONTNEED?)


Should be skipped. Will be done.

> 
> ...
>> +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start,
>> +                               loff_t end)
>> +{
>> +       int ret;
>> +       struct page *page;
>> +       LIST_HEAD(list);
>> +       struct writeback_control wbc = {
>> +               .sync_mode = WB_SYNC_NONE,
>> +               .nr_to_write = LONG_MAX,
>> +               .range_start = 0,
>> +               .range_end = LLONG_MAX,
>> +               .for_reclaim = 1,
>> +       };
>> +
>> +       if (!shmem_mapping(mapping))
>> +               return -EINVAL;
>> +
>> +       if (!total_swap_pages)
>> +               return 0;
>> +
>> +       lru_add_drain();
>> +       shmem_isolate_pages_range(mapping, start, end, &list);
>> +
>> +       while (!list_empty(&list)) {
>> +               page = lru_to_page(&list);
>> +               list_del(&page->lru);
>> +               if (page_mapped(page))
>> +                       goto keep;
>> +               if (!trylock_page(page))
>> +                       goto keep;
>> +               if (unlikely(PageTransHuge(page))) {
>> +                       if (split_huge_page_to_list(page, &list))
>> +                               goto keep;
>> +               }
> 
> I don't know the shmem code and the lifecycle of a shm-page, so
> genuine questions;
> When the try-lock succeeds, should there be a test for PageWriteback()
> (page skipped if true)?  Also, does page->mapping need to be tested
> for NULL to prevent races with deletion from the page-cache?

I failed to envisage it. I should have considered both these conditions
here. BTW, I am just thinking about why we shouldn't use
reclaim_pages(page_list) function here with an extra set_page_dirty() on
a page that is isolated? It just call the shrink_page_list() where all
these conditions are properly handled. What is your opinion here?

> 
> ...
>> +
>> +               clear_page_dirty_for_io(page);
>> +               SetPageReclaim(page);
>> +               ret = shmem_writepage(page, &wbc);
>> +               if (ret || PageWriteback(page)) {
>> +                       if (ret)
>> +                               unlock_page(page);
>> +                       goto keep;
>> +               }
>> +
>> +               if (!PageWriteback(page))
>> +                       ClearPageReclaim(page);
>> +
>> +               /*
>> +                * shmem_writepage() place the page in the swapcache.
>> +                * Delete the page from the swapcache and release the
>> +                * page.
>> +                */
>> +               __mod_node_page_state(page_pgdat(page),
>> +                               NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page));
>> +               lock_page(page);
>> +               delete_from_swap_cache(page);
>> +               unlock_page(page);
>> +               put_page(page);
>> +               continue;
>> +keep:
>> +               putback_lru_page(page);
>> +               __mod_node_page_state(page_pgdat(page),
>> +                               NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page));
>> +       }
> 
> The putback_lru_page() drops the last reference hold this code has on
> 'page'.  Is it safe to use 'page' after dropping this reference?

True. Will correct it in the next revision.

> 
> Cheers,
> Mark
>
Charan Teja Kalla Jan. 12, 2022, 8:21 a.m. UTC | #5
Hello Mark,

On 1/10/2022 3:51 PM, Charan Teja Kalla wrote:
>>> +static int shmem_fadvise_willneed(struct address_space *mapping,
>>> +                                pgoff_t start, pgoff_t long end)
>>> +{
>>> +       XA_STATE(xas, &mapping->i_pages, start);
>>> +       struct page *page;
>>> +
>>> +       rcu_read_lock();
>>> +       xas_for_each(&xas, page, end) {
>>> +               if (!xa_is_value(page))
>>> +                       continue;
>>> +               xas_pause(&xas);
>>> +               rcu_read_unlock();
>>> +
>>> +               page = shmem_read_mapping_page(mapping, xas.xa_index);
>>> +               if (!IS_ERR(page))
>>> +                       put_page(page);
>>> +
>>> +               rcu_read_lock();
>>> +               if (need_resched()) {
>>> +                       xas_pause(&xas);
>>> +                       cond_resched_rcu();
>>> +               }
>>> +       }
>>> +       rcu_read_unlock();
>>> +
>>> +       return 0;
>> I have a doubt on referencing xa_index after calling xas_pause().
>> xas_pause() walks xa_index forward, so will not be the value expected
>> for the current page.
> Agree here. I should have the better test case to verify my changes.
> 
>> Also, not necessary to re-call xas_pause() before cond_resched (it is
>> a no-op).
> In the event when CONFIG_DEBUG_ATOMIC_SLEEP is enabled users may still
> need to call the xas_pause(), as we are dropping the rcu lock. NO?
> 
> static inline void cond_resched_rcu(void)
> {
> #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
>         rcu_read_unlock();
>         cond_resched();
>         rcu_read_lock();
> #endif
> }
> 
>> Would be better to check need_resched() before
>> rcu_read_lock().
> Okay, I can directly use cond_resched() if used before rcu_read_lock().
> 
>> As this loop may call xas_pause() for most iterations, should consider
>> using xa_for_each() instead (I *think* - still getting up to speed
>> with XArray).
> Even the xarray documentation says that: If most entries found during a
> walk require you to call xas_pause(), the xa_for_each() iterator may be
> more appropriate.
> 
> Since every value entry found in the xarray requires me to do the
> xas_pause(), I do agree that xa_for_each() is the appropriate call here.
> Will switch to this in the next spin. Waiting for further review
> comments on this patch.

I also found the below documentation:
xa_for_each() will spin if it hits a retry entry; if you intend to see
retry entries, you should use the xas_for_each() iterator instead.

Since retry entries are expected, I should be using the xas_for_each()
with the corrections you had pointed out. Isn't it?

>
Mark Hemment Jan. 12, 2022, 11:34 a.m. UTC | #6
On Wed, 12 Jan 2022 at 08:22, Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Hello Mark,
>
> On 1/10/2022 3:51 PM, Charan Teja Kalla wrote:
> >>> +static int shmem_fadvise_willneed(struct address_space *mapping,
> >>> +                                pgoff_t start, pgoff_t long end)
> >>> +{
> >>> +       XA_STATE(xas, &mapping->i_pages, start);
> >>> +       struct page *page;
> >>> +
> >>> +       rcu_read_lock();
> >>> +       xas_for_each(&xas, page, end) {
> >>> +               if (!xa_is_value(page))
> >>> +                       continue;
> >>> +               xas_pause(&xas);
> >>> +               rcu_read_unlock();
> >>> +
> >>> +               page = shmem_read_mapping_page(mapping, xas.xa_index);
> >>> +               if (!IS_ERR(page))
> >>> +                       put_page(page);
> >>> +
> >>> +               rcu_read_lock();
> >>> +               if (need_resched()) {
> >>> +                       xas_pause(&xas);
> >>> +                       cond_resched_rcu();
> >>> +               }
> >>> +       }
> >>> +       rcu_read_unlock();
> >>> +
> >>> +       return 0;
> >> I have a doubt on referencing xa_index after calling xas_pause().
> >> xas_pause() walks xa_index forward, so will not be the value expected
> >> for the current page.
> > Agree here. I should have the better test case to verify my changes.
> >
> >> Also, not necessary to re-call xas_pause() before cond_resched (it is
> >> a no-op).
> > In the event when CONFIG_DEBUG_ATOMIC_SLEEP is enabled users may still
> > need to call the xas_pause(), as we are dropping the rcu lock. NO?
> >
> > static inline void cond_resched_rcu(void)
> > {
> > #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
> >         rcu_read_unlock();
> >         cond_resched();
> >         rcu_read_lock();
> > #endif
> > }
> >
> >> Would be better to check need_resched() before
> >> rcu_read_lock().
> > Okay, I can directly use cond_resched() if used before rcu_read_lock().
> >
> >> As this loop may call xas_pause() for most iterations, should consider
> >> using xa_for_each() instead (I *think* - still getting up to speed
> >> with XArray).
> > Even the xarray documentation says that: If most entries found during a
> > walk require you to call xas_pause(), the xa_for_each() iterator may be
> > more appropriate.
> >
> > Since every value entry found in the xarray requires me to do the
> > xas_pause(), I do agree that xa_for_each() is the appropriate call here.
> > Will switch to this in the next spin. Waiting for further review
> > comments on this patch.
>
> I also found the below documentation:
> xa_for_each() will spin if it hits a retry entry; if you intend to see
> retry entries, you should use the xas_for_each() iterator instead.
>
> Since retry entries are expected, I should be using the xas_for_each()
> with the corrections you had pointed out. Isn't it?
>

Ah, you've hit a barrier on my Xarray knowledge.
The current shmem code simply does a 'continue' on xas_retry().  Is
this different from Xarray looping internally for xas_retry()?  I
assume not, but cannot give an definite answer (sorry).

Cheers,
Mark
Mark Hemment Jan. 12, 2022, 11:38 a.m. UTC | #7
On Mon, 10 Jan 2022 at 15:14, Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Thanks again Mark for the review comments!!
>
> On 1/10/2022 6:06 PM, Mark Hemment wrote:
> > On Thu, 6 Jan 2022 at 17:06, Charan Teja Reddy
> > <quic_charante@quicinc.com> wrote:
> >>
> >> From: Charan Teja Reddy <charante@codeaurora.org>
> >>
> >> Currently fadvise(2) is supported only for the files that doesn't
> >> associated with noop_backing_dev_info thus for the files, like shmem,
> >> fadvise results into NOP. But then there is file_operations->fadvise()
> >> that lets the file systems to implement their own fadvise
> >> implementation. Use this support to implement some of the POSIX_FADV_XXX
> >> functionality for shmem files.
> > ...
> >> +static void shmem_isolate_pages_range(struct address_space *mapping, loff_t start,
> >> +                               loff_t end, struct list_head *list)
> >> +{
> >> +       XA_STATE(xas, &mapping->i_pages, start);
> >> +       struct page *page;
> >> +
> >> +       rcu_read_lock();
> >> +       xas_for_each(&xas, page, end) {
> >> +               if (xas_retry(&xas, page))
> >> +                       continue;
> >> +               if (xa_is_value(page))
> >> +                       continue;
> >> +               if (!get_page_unless_zero(page))
> >> +                       continue;
> >> +               if (isolate_lru_page(page))
> >> +                       continue;
> >
> > Need to unwind the get_page on failure to isolate.
>
> Will be done.
>
> >
> > Should PageUnevicitable() pages (SHM_LOCK) be skipped?
> > (That is, does SHM_LOCK override DONTNEED?)
>
>
> Should be skipped. Will be done.
>
> >
> > ...
> >> +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start,
> >> +                               loff_t end)
> >> +{
> >> +       int ret;
> >> +       struct page *page;
> >> +       LIST_HEAD(list);
> >> +       struct writeback_control wbc = {
> >> +               .sync_mode = WB_SYNC_NONE,
> >> +               .nr_to_write = LONG_MAX,
> >> +               .range_start = 0,
> >> +               .range_end = LLONG_MAX,
> >> +               .for_reclaim = 1,
> >> +       };
> >> +
> >> +       if (!shmem_mapping(mapping))
> >> +               return -EINVAL;
> >> +
> >> +       if (!total_swap_pages)
> >> +               return 0;
> >> +
> >> +       lru_add_drain();
> >> +       shmem_isolate_pages_range(mapping, start, end, &list);
> >> +
> >> +       while (!list_empty(&list)) {
> >> +               page = lru_to_page(&list);
> >> +               list_del(&page->lru);
> >> +               if (page_mapped(page))
> >> +                       goto keep;
> >> +               if (!trylock_page(page))
> >> +                       goto keep;
> >> +               if (unlikely(PageTransHuge(page))) {
> >> +                       if (split_huge_page_to_list(page, &list))
> >> +                               goto keep;
> >> +               }
> >
> > I don't know the shmem code and the lifecycle of a shm-page, so
> > genuine questions;
> > When the try-lock succeeds, should there be a test for PageWriteback()
> > (page skipped if true)?  Also, does page->mapping need to be tested
> > for NULL to prevent races with deletion from the page-cache?
>
> I failed to envisage it. I should have considered both these conditions
> here. BTW, I am just thinking about why we shouldn't use
> reclaim_pages(page_list) function here with an extra set_page_dirty() on
> a page that is isolated? It just call the shrink_page_list() where all
> these conditions are properly handled. What is your opinion here?

Should be possible to use reclaim_pages() (I haven't look closely).
It might actually be good to use this function, as will do some
congestion throttling.  Although it will always try to unmap
pages (note: your page_mapped() test is 'unstable' as done without the
page locked), so might give behaviour you want to avoid.
Note: reclaim_pages() is already used for madvise(PAGEOUT).  The shmem
code would need to prepare page(s) to help shrink_page_list() to make
progress (see madvise.c:madvise_cold_or_pageout_pte_range()).

Taking a step back; is fadvise(DONTNEED) really needed/wanted?  Yes,
you gave a usecase (which I cut from this thread in my earlier reply),
but I'm not familiar with various shmem uses to know if this feature
is needed.  Someone else will need to answer this.

Cheers,
Mark

>
> >
> > ...
> >> +
> >> +               clear_page_dirty_for_io(page);
> >> +               SetPageReclaim(page);
> >> +               ret = shmem_writepage(page, &wbc);
> >> +               if (ret || PageWriteback(page)) {
> >> +                       if (ret)
> >> +                               unlock_page(page);
> >> +                       goto keep;
> >> +               }
> >> +
> >> +               if (!PageWriteback(page))
> >> +                       ClearPageReclaim(page);
> >> +
> >> +               /*
> >> +                * shmem_writepage() place the page in the swapcache.
> >> +                * Delete the page from the swapcache and release the
> >> +                * page.
> >> +                */
> >> +               __mod_node_page_state(page_pgdat(page),
> >> +                               NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page));
> >> +               lock_page(page);
> >> +               delete_from_swap_cache(page);
> >> +               unlock_page(page);
> >> +               put_page(page);
> >> +               continue;
> >> +keep:
> >> +               putback_lru_page(page);
> >> +               __mod_node_page_state(page_pgdat(page),
> >> +                               NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page));
> >> +       }
> >
> > The putback_lru_page() drops the last reference hold this code has on
> > 'page'.  Is it safe to use 'page' after dropping this reference?
>
> True. Will correct it in the next revision.
>
> >
> > Cheers,
> > Mark
> >
Matthew Wilcox (Oracle) Jan. 12, 2022, 1:19 p.m. UTC | #8
On Wed, Jan 12, 2022 at 01:51:55PM +0530, Charan Teja Kalla wrote:
> >>> +       rcu_read_lock();
> >>> +       xas_for_each(&xas, page, end) {
> >>> +               if (!xa_is_value(page))
> >>> +                       continue;
> >>> +               xas_pause(&xas);
> >>> +               rcu_read_unlock();
> >>> +
> >>> +               page = shmem_read_mapping_page(mapping, xas.xa_index);
> >>> +               if (!IS_ERR(page))
> >>> +                       put_page(page);
> >>> +
> >>> +               rcu_read_lock();
> >>> +               if (need_resched()) {
> >>> +                       xas_pause(&xas);
> >>> +                       cond_resched_rcu();
> >>> +               }
> >>> +       }
> >>> +       rcu_read_unlock();
> > Even the xarray documentation says that: If most entries found during a
> > walk require you to call xas_pause(), the xa_for_each() iterator may be
> > more appropriate.

Yes.  This should obviously be an xa_for_each() loop.

> > Since every value entry found in the xarray requires me to do the
> > xas_pause(), I do agree that xa_for_each() is the appropriate call here.
> > Will switch to this in the next spin. Waiting for further review
> > comments on this patch.
> 
> I also found the below documentation:
> xa_for_each() will spin if it hits a retry entry; if you intend to see
> retry entries, you should use the xas_for_each() iterator instead.
> 
> Since retry entries are expected, I should be using the xas_for_each()
> with the corrections you had pointed out. Isn't it?

No.  You aren't handling retry entries at all; you clearly don't
expect to see them.  Just let the XArray code handle them itself.
Charan Teja Kalla Jan. 12, 2022, 1:35 p.m. UTC | #9
Thanks Matthew for the review.

On 1/12/2022 6:49 PM, Matthew Wilcox wrote:
> On Wed, Jan 12, 2022 at 01:51:55PM +0530, Charan Teja Kalla wrote:
>>>>> +       rcu_read_lock();
>>>>> +       xas_for_each(&xas, page, end) {
>>>>> +               if (!xa_is_value(page))
>>>>> +                       continue;
>>>>> +               xas_pause(&xas);
>>>>> +               rcu_read_unlock();
>>>>> +
>>>>> +               page = shmem_read_mapping_page(mapping, xas.xa_index);
>>>>> +               if (!IS_ERR(page))
>>>>> +                       put_page(page);
>>>>> +
>>>>> +               rcu_read_lock();
>>>>> +               if (need_resched()) {
>>>>> +                       xas_pause(&xas);
>>>>> +                       cond_resched_rcu();
>>>>> +               }
>>>>> +       }
>>>>> +       rcu_read_unlock();
>>> Even the xarray documentation says that: If most entries found during a
>>> walk require you to call xas_pause(), the xa_for_each() iterator may be
>>> more appropriate.
> 
> Yes.  This should obviously be an xa_for_each() loop.
> 

ACK.

>>> Since every value entry found in the xarray requires me to do the
>>> xas_pause(), I do agree that xa_for_each() is the appropriate call here.
>>> Will switch to this in the next spin. Waiting for further review
>>> comments on this patch.
>>
>> I also found the below documentation:
>> xa_for_each() will spin if it hits a retry entry; if you intend to see
>> retry entries, you should use the xas_for_each() iterator instead.
>>
>> Since retry entries are expected, I should be using the xas_for_each()
>> with the corrections you had pointed out. Isn't it?
> 
> No.  You aren't handling retry entries at all; you clearly don't
> expect to see them.  Just let the XArray code handle them itself.

ACK.

>
Charan Teja Kalla Jan. 12, 2022, 3:43 p.m. UTC | #10
Thanks Mark!!

On 1/12/2022 5:08 PM, Mark Hemment wrote:
>>>> +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start,
>>>> +                               loff_t end)
>>>> +{
>>>> +       int ret;
>>>> +       struct page *page;
>>>> +       LIST_HEAD(list);
>>>> +       struct writeback_control wbc = {
>>>> +               .sync_mode = WB_SYNC_NONE,
>>>> +               .nr_to_write = LONG_MAX,
>>>> +               .range_start = 0,
>>>> +               .range_end = LLONG_MAX,
>>>> +               .for_reclaim = 1,
>>>> +       };
>>>> +
>>>> +       if (!shmem_mapping(mapping))
>>>> +               return -EINVAL;
>>>> +
>>>> +       if (!total_swap_pages)
>>>> +               return 0;
>>>> +
>>>> +       lru_add_drain();
>>>> +       shmem_isolate_pages_range(mapping, start, end, &list);
>>>> +
>>>> +       while (!list_empty(&list)) {
>>>> +               page = lru_to_page(&list);
>>>> +               list_del(&page->lru);
>>>> +               if (page_mapped(page))
>>>> +                       goto keep;
>>>> +               if (!trylock_page(page))
>>>> +                       goto keep;
>>>> +               if (unlikely(PageTransHuge(page))) {
>>>> +                       if (split_huge_page_to_list(page, &list))
>>>> +                               goto keep;
>>>> +               }
>>> I don't know the shmem code and the lifecycle of a shm-page, so
>>> genuine questions;
>>> When the try-lock succeeds, should there be a test for PageWriteback()
>>> (page skipped if true)?  Also, does page->mapping need to be tested
>>> for NULL to prevent races with deletion from the page-cache?
>> I failed to envisage it. I should have considered both these conditions
>> here. BTW, I am just thinking about why we shouldn't use
>> reclaim_pages(page_list) function here with an extra set_page_dirty() on
>> a page that is isolated? It just call the shrink_page_list() where all
>> these conditions are properly handled. What is your opinion here?
> Should be possible to use reclaim_pages() (I haven't look closely).
> It might actually be good to use this function, as will do some
> congestion throttling.  Although it will always try to unmap
> pages (note: your page_mapped() test is 'unstable' as done without the
> page locked), so might give behaviour you want to avoid.

page_mapped can be true between isolate and then asking for reclaim of
it through reclaim_pages(), and then can be unmapped there. Thanks for
pointing it out.

BTW, the posix_fadvise man pages[1] doesn't talk about any restrictions
with the mapped pages. If so, Am I allowed to even unmap the pages when
called FADV_DONTNEED on the file (agree for mapped, we can rely on
MADV_DONTNEED too)?

[1]https://man7.org/linux/man-pages/man2/posix_fadvise.2.html

> Note: reclaim_pages() is already used for madvise(PAGEOUT).  The shmem
> code would need to prepare page(s) to help shrink_page_list() to make
> progress (see madvise.c:madvise_cold_or_pageout_pte_range()).
> 
> Taking a step back; is fadvise(DONTNEED) really needed/wanted?  Yes,
> you gave a usecase (which I cut from this thread in my earlier reply),
> but I'm not familiar with various shmem uses to know if this feature
> is needed.  Someone else will need to answer this.

Actually I needed this for the use case mentioned. And regarding the
various use cases, I encountered that GEM buffers for display/graphics
are using the shmem buffers.
drivers/gpu/drm/i915/gem/i915_gem_shmem.c
Charan Teja Kalla Jan. 18, 2022, 11:35 a.m. UTC | #11
Hello Matthew,

On 1/12/2022 7:05 PM, Charan Teja Kalla wrote:
>>>>> +       rcu_read_lock();
>>>>> +       xas_for_each(&xas, page, end) {
>>>>> +               if (!xa_is_value(page))
>>>>> +                       continue;
>>>>> +               xas_pause(&xas);
>>>>> +               rcu_read_unlock();
>>>>> +
>>>>> +               page = shmem_read_mapping_page(mapping, xas.xa_index);
>>>>> +               if (!IS_ERR(page))
>>>>> +                       put_page(page);
>>>>> +
>>>>> +               rcu_read_lock();
>>>>> +               if (need_resched()) {
>>>>> +                       xas_pause(&xas);
>>>>> +                       cond_resched_rcu();
>>>>> +               }
>>>>> +       }
>>>>> +       rcu_read_unlock();
>>> Even the xarray documentation says that: If most entries found during a
>>> walk require you to call xas_pause(), the xa_for_each() iterator may be
>>> more appropriate.
> Yes.  This should obviously be an xa_for_each() loop.

In one of your patch[1], where we used xarray iterator, though most of
the entries found requires to call xas_pause() but still endup in using
xas_for_each() rather than xa_for_each(). Then, Should this code be
changed to use xa_for_each()? The documentation also says that "The
xas_for_each() iterator will expand into more inline code than
xa_for_each()."

[1]https://patchwork.kernel.org/project/linux-mm/patch/20200819184850.24779-4-willy@infradead.org/
Matthew Wilcox (Oracle) Jan. 18, 2022, 1:27 p.m. UTC | #12
On Tue, Jan 18, 2022 at 05:05:40PM +0530, Charan Teja Kalla wrote:
> Hello Matthew,
> 
> > Yes.  This should obviously be an xa_for_each() loop.
> 
> In one of your patch[1], where we used xarray iterator, though most of
> the entries found requires to call xas_pause() but still endup in using
> xas_for_each() rather than xa_for_each(). Then, Should this code be
> changed to use xa_for_each()? The documentation also says that "The
> xas_for_each() iterator will expand into more inline code than
> xa_for_each()."
> 
> [1]https://patchwork.kernel.org/project/linux-mm/patch/20200819184850.24779-4-willy@infradead.org/

How do you know the distribution of swap and non-swap entries in that
region of that xarray?
diff mbox series

Patch

diff --git a/mm/shmem.c b/mm/shmem.c
index 23c91a8..dd3ac2e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -39,6 +39,8 @@ 
 #include <linux/frontswap.h>
 #include <linux/fs_parser.h>
 #include <linux/swapfile.h>
+#include <linux/mm_inline.h>
+#include <linux/fadvise.h>
 
 static struct vfsmount *shm_mnt;
 
@@ -2275,6 +2277,175 @@  static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+static void shmem_isolate_pages_range(struct address_space *mapping, loff_t start,
+				loff_t end, struct list_head *list)
+{
+	XA_STATE(xas, &mapping->i_pages, start);
+	struct page *page;
+
+	rcu_read_lock();
+	xas_for_each(&xas, page, end) {
+		if (xas_retry(&xas, page))
+			continue;
+		if (xa_is_value(page))
+			continue;
+		if (!get_page_unless_zero(page))
+			continue;
+		if (isolate_lru_page(page))
+			continue;
+
+		list_add(&page->lru, list);
+		__mod_node_page_state(page_pgdat(page),
+				NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page));
+		put_page(page);
+		if (need_resched()) {
+			xas_pause(&xas);
+			cond_resched_rcu();
+		}
+	}
+	rcu_read_unlock();
+}
+
+static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start,
+				loff_t end)
+{
+	int ret;
+	struct page *page;
+	LIST_HEAD(list);
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = LONG_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.for_reclaim = 1,
+	};
+
+	if (!shmem_mapping(mapping))
+		return -EINVAL;
+
+	if (!total_swap_pages)
+		return 0;
+
+	lru_add_drain();
+	shmem_isolate_pages_range(mapping, start, end, &list);
+
+	while (!list_empty(&list)) {
+		page = lru_to_page(&list);
+		list_del(&page->lru);
+		if (page_mapped(page))
+			goto keep;
+		if (!trylock_page(page))
+			goto keep;
+		if (unlikely(PageTransHuge(page))) {
+			if (split_huge_page_to_list(page, &list))
+				goto keep;
+		}
+
+		clear_page_dirty_for_io(page);
+		SetPageReclaim(page);
+		ret = shmem_writepage(page, &wbc);
+		if (ret || PageWriteback(page)) {
+			if (ret)
+				unlock_page(page);
+			goto keep;
+		}
+
+		if (!PageWriteback(page))
+			ClearPageReclaim(page);
+
+		/*
+		 * shmem_writepage() place the page in the swapcache.
+		 * Delete the page from the swapcache and release the
+		 * page.
+		 */
+		__mod_node_page_state(page_pgdat(page),
+				NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page));
+		lock_page(page);
+		delete_from_swap_cache(page);
+		unlock_page(page);
+		put_page(page);
+		continue;
+keep:
+		putback_lru_page(page);
+		__mod_node_page_state(page_pgdat(page),
+				NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page));
+	}
+
+	return 0;
+}
+
+static int shmem_fadvise_willneed(struct address_space *mapping,
+				 pgoff_t start, pgoff_t long end)
+{
+	XA_STATE(xas, &mapping->i_pages, start);
+	struct page *page;
+
+	rcu_read_lock();
+	xas_for_each(&xas, page, end) {
+		if (!xa_is_value(page))
+			continue;
+		xas_pause(&xas);
+		rcu_read_unlock();
+
+		page = shmem_read_mapping_page(mapping, xas.xa_index);
+		if (!IS_ERR(page))
+			put_page(page);
+
+		rcu_read_lock();
+		if (need_resched()) {
+			xas_pause(&xas);
+			cond_resched_rcu();
+		}
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+
+static int shmem_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
+{
+	loff_t endbyte;
+	pgoff_t start_index;
+	pgoff_t end_index;
+	struct address_space *mapping;
+	int ret = 0;
+
+	mapping = file->f_mapping;
+	if (!mapping || len < 0)
+		return -EINVAL;
+
+	endbyte = (u64)offset + (u64)len;
+	if (!len || endbyte < len)
+		endbyte = -1;
+	else
+		endbyte--;
+
+
+	start_index = offset >> PAGE_SHIFT;
+	end_index   = endbyte >> PAGE_SHIFT;
+	switch (advice) {
+	case POSIX_FADV_DONTNEED:
+		ret = shmem_fadvise_dontneed(mapping, start_index, end_index);
+		break;
+	case POSIX_FADV_WILLNEED:
+		ret = shmem_fadvise_willneed(mapping, start_index, end_index);
+		break;
+	case POSIX_FADV_NORMAL:
+	case POSIX_FADV_RANDOM:
+	case POSIX_FADV_SEQUENTIAL:
+	case POSIX_FADV_NOREUSE:
+		/*
+		 * No bad return value, but ignore advice. May have to
+		 * implement in future.
+		 */
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
 static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
 				     umode_t mode, dev_t dev, unsigned long flags)
 {
@@ -3826,6 +3997,7 @@  static const struct file_operations shmem_file_operations = {
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= shmem_fallocate,
 #endif
+	.fadvise	= shmem_fadvise,
 };
 
 static const struct inode_operations shmem_inode_operations = {