Message ID | 20230525191507.160076-1-khalid.aziz@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v4] mm, compaction: Skip all non-migratable pages during scan | expand |
On Thu, May 25, 2023 at 01:15:07PM -0600, Khalid Aziz wrote: > diff --git a/mm/compaction.c b/mm/compaction.c > index 5a9501e0ae01..b548e05f0349 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -764,6 +764,42 @@ static bool too_many_isolated(pg_data_t *pgdat) > return too_many; > } > > +/* > + * Check if this base page should be skipped from isolation because > + * it has extra refcounts that will prevent it from being migrated. > + * This code is inspired by similar code in migrate_vma_check_page(), > + * can_split_folio() and folio_migrate_mapping() > + */ > +static inline bool page_has_extra_refs(struct page *page, > + struct address_space *mapping) > +{ > + unsigned long extra_refs; > + struct folio *folio; > + > + /* > + * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA > + * pages that can not be long term pinned > + */ > + if (is_zone_movable_page(page) || is_migrate_cma_page(page)) > + return false; > + > + folio = page_folio(page); > + > + /* > + * caller holds a ref already from get_page_unless_zero() > + * which is accounted for in folio_expected_refs() > + */ > + extra_refs = folio_expected_refs(mapping, folio); > + > + /* > + * This is an admittedly racy check but good enough to determine > + * if a page is pinned and can not be migrated > + */ > + if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio)) > + return true; > + return false; > +} > + > /** > * isolate_migratepages_block() - isolate all migrate-able pages within > * a single pageblock > @@ -992,12 +1028,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, > goto isolate_fail; Just out of shot, we have ... if (unlikely(!get_page_unless_zero(page))) This is the perfect opportunity to use folio_get_nontail_page() instead. You get back the folio without having to cast the pointer yourself or call page_folio(). Now you can use a folio throughout your new function, saving a call to compound_head(). For a followup patch, everything in this loop below this point can use the folio ... that's quite a lot of change. > /* > - * Migration will fail if an anonymous page is pinned in memory, > - * so avoid taking lru_lock and isolating it unnecessarily in an > - * admittedly racy check. > + * Migration will fail if a page has extra refcounts > + * from long term pinning preventing it from migrating, > + * so avoid taking lru_lock and isolating it unnecessarily. > */ Isn't "long term pinning" the wrong description of the problem? Long term pins suggest to me FOLL_LONGTERM. I think this is simple short term pins that we care about here.
On 5/25/2023 3:58 PM, Matthew Wilcox wrote: > On Thu, May 25, 2023 at 01:15:07PM -0600, Khalid Aziz wrote: >> diff --git a/mm/compaction.c b/mm/compaction.c >> index 5a9501e0ae01..b548e05f0349 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -764,6 +764,42 @@ static bool too_many_isolated(pg_data_t *pgdat) >> return too_many; >> } >> >> +/* >> + * Check if this base page should be skipped from isolation because >> + * it has extra refcounts that will prevent it from being migrated. >> + * This code is inspired by similar code in migrate_vma_check_page(), >> + * can_split_folio() and folio_migrate_mapping() >> + */ >> +static inline bool page_has_extra_refs(struct page *page, >> + struct address_space *mapping) >> +{ >> + unsigned long extra_refs; >> + struct folio *folio; >> + >> + /* >> + * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA >> + * pages that can not be long term pinned >> + */ >> + if (is_zone_movable_page(page) || is_migrate_cma_page(page)) >> + return false; >> + >> + folio = page_folio(page); >> + >> + /* >> + * caller holds a ref already from get_page_unless_zero() >> + * which is accounted for in folio_expected_refs() >> + */ >> + extra_refs = folio_expected_refs(mapping, folio); >> + >> + /* >> + * This is an admittedly racy check but good enough to determine >> + * if a page is pinned and can not be migrated >> + */ >> + if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio)) >> + return true; >> + return false; >> +} >> + >> /** >> * isolate_migratepages_block() - isolate all migrate-able pages within >> * a single pageblock >> @@ -992,12 +1028,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, >> goto isolate_fail; > > Just out of shot, we have ... > > if (unlikely(!get_page_unless_zero(page))) > > This is the perfect opportunity to use folio_get_nontail_page() instead. > You get back the folio without having to cast the pointer yourself > or call page_folio(). Now you can use a folio throughout your new > function, saving a call to compound_head(). > > For a followup patch, everything in this loop below this point can use > the folio ... that's quite a lot of change. > >> /* >> - * Migration will fail if an anonymous page is pinned in memory, >> - * so avoid taking lru_lock and isolating it unnecessarily in an >> - * admittedly racy check. >> + * Migration will fail if a page has extra refcounts >> + * from long term pinning preventing it from migrating, >> + * so avoid taking lru_lock and isolating it unnecessarily. >> */ > > Isn't "long term pinning" the wrong description of the problem? Long term > pins suggest to me FOLL_LONGTERM. I think this is simple short term > pins that we care about here. vfio pins are held for a long time - Steve
On 5/25/23 13:58, Matthew Wilcox wrote: > On Thu, May 25, 2023 at 01:15:07PM -0600, Khalid Aziz wrote: >> diff --git a/mm/compaction.c b/mm/compaction.c >> index 5a9501e0ae01..b548e05f0349 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -764,6 +764,42 @@ static bool too_many_isolated(pg_data_t *pgdat) >> return too_many; >> } >> >> +/* >> + * Check if this base page should be skipped from isolation because >> + * it has extra refcounts that will prevent it from being migrated. >> + * This code is inspired by similar code in migrate_vma_check_page(), >> + * can_split_folio() and folio_migrate_mapping() >> + */ >> +static inline bool page_has_extra_refs(struct page *page, >> + struct address_space *mapping) >> +{ >> + unsigned long extra_refs; >> + struct folio *folio; >> + >> + /* >> + * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA >> + * pages that can not be long term pinned >> + */ >> + if (is_zone_movable_page(page) || is_migrate_cma_page(page)) >> + return false; >> + >> + folio = page_folio(page); >> + >> + /* >> + * caller holds a ref already from get_page_unless_zero() >> + * which is accounted for in folio_expected_refs() >> + */ >> + extra_refs = folio_expected_refs(mapping, folio); >> + >> + /* >> + * This is an admittedly racy check but good enough to determine >> + * if a page is pinned and can not be migrated >> + */ >> + if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio)) >> + return true; >> + return false; >> +} >> + >> /** >> * isolate_migratepages_block() - isolate all migrate-able pages within >> * a single pageblock >> @@ -992,12 +1028,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, >> goto isolate_fail; > > Just out of shot, we have ... > > if (unlikely(!get_page_unless_zero(page))) > > This is the perfect opportunity to use folio_get_nontail_page() instead. > You get back the folio without having to cast the pointer yourself > or call page_folio(). Now you can use a folio throughout your new > function, saving a call to compound_head(). > > For a followup patch, everything in this loop below this point can use > the folio ... that's quite a lot of change. Can that all be in a separate patch by itself? I tried to keep all folio functions contained inside page_has_extra_refs(). If we change part of isolate_migratepages_block() to folio, it would make sense to change rest of the function at the same time. > >> /* >> - * Migration will fail if an anonymous page is pinned in memory, >> - * so avoid taking lru_lock and isolating it unnecessarily in an >> - * admittedly racy check. >> + * Migration will fail if a page has extra refcounts >> + * from long term pinning preventing it from migrating, >> + * so avoid taking lru_lock and isolating it unnecessarily. >> */ > > Isn't "long term pinning" the wrong description of the problem? Long term > pins suggest to me FOLL_LONGTERM. I think this is simple short term > pins that we care about here. > As Steve pointed out, vfio pinned pages are long term and we are concerned about long term pinned pages since no matter how many times we go over them, they will not migrate. Thanks, Khalid
On Thu, May 25, 2023 at 04:15:07PM -0400, Steven Sistare wrote: > On 5/25/2023 3:58 PM, Matthew Wilcox wrote: > > On Thu, May 25, 2023 at 01:15:07PM -0600, Khalid Aziz wrote: > >> diff --git a/mm/compaction.c b/mm/compaction.c > >> index 5a9501e0ae01..b548e05f0349 100644 > >> --- a/mm/compaction.c > >> +++ b/mm/compaction.c > >> @@ -764,6 +764,42 @@ static bool too_many_isolated(pg_data_t *pgdat) > >> return too_many; > >> } > >> > >> +/* > >> + * Check if this base page should be skipped from isolation because > >> + * it has extra refcounts that will prevent it from being migrated. > >> + * This code is inspired by similar code in migrate_vma_check_page(), > >> + * can_split_folio() and folio_migrate_mapping() > >> + */ > >> +static inline bool page_has_extra_refs(struct page *page, > >> + struct address_space *mapping) > >> +{ > >> + unsigned long extra_refs; > >> + struct folio *folio; > >> + > >> + /* > >> + * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA > >> + * pages that can not be long term pinned > >> + */ > >> + if (is_zone_movable_page(page) || is_migrate_cma_page(page)) > >> + return false; > >> + > >> + folio = page_folio(page); > >> + > >> + /* > >> + * caller holds a ref already from get_page_unless_zero() > >> + * which is accounted for in folio_expected_refs() > >> + */ > >> + extra_refs = folio_expected_refs(mapping, folio); > >> + > >> + /* > >> + * This is an admittedly racy check but good enough to determine > >> + * if a page is pinned and can not be migrated > >> + */ > >> + if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio)) > >> + return true; > >> + return false; > >> +} > >> + > >> /** > >> * isolate_migratepages_block() - isolate all migrate-able pages within > >> * a single pageblock > >> @@ -992,12 +1028,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, > >> goto isolate_fail; > > > > Just out of shot, we have ... > > > > if (unlikely(!get_page_unless_zero(page))) > > > > This is the perfect opportunity to use folio_get_nontail_page() instead. > > You get back the folio without having to cast the pointer yourself > > or call page_folio(). Now you can use a folio throughout your new > > function, saving a call to compound_head(). > > > > For a followup patch, everything in this loop below this point can use > > the folio ... that's quite a lot of change. > > > >> /* > >> - * Migration will fail if an anonymous page is pinned in memory, > >> - * so avoid taking lru_lock and isolating it unnecessarily in an > >> - * admittedly racy check. > >> + * Migration will fail if a page has extra refcounts > >> + * from long term pinning preventing it from migrating, > >> + * so avoid taking lru_lock and isolating it unnecessarily. > >> */ > > > > Isn't "long term pinning" the wrong description of the problem? Long term > > pins suggest to me FOLL_LONGTERM. I think this is simple short term > > pins that we care about here. > > vfio pins are held for a long time - Steve So this is a third sense of "pinned pages" that is neither what filesystems nor the mm means by pinned pages, but whatever it is that vfio means by pinned pages? If only "pin" weren't such a desirable word. Can somebody explain to me in small words what a vfio pin looks like because I've tried reading vfio_iommu_type1_pin_pages() and I don't recognise anything there that looks like pinning in either of the other two senses.
On Thu, May 25, 2023 at 09:45:34PM +0100, Matthew Wilcox wrote: > > > Isn't "long term pinning" the wrong description of the problem? Long term > > > pins suggest to me FOLL_LONGTERM. I think this is simple short term > > > pins that we care about here. > > > > vfio pins are held for a long time - Steve > > So this is a third sense of "pinned pages" that is neither what > filesystems nor the mm means by pinned pages, but whatever it is that > vfio means by pinned pages? If only "pin" weren't such a desirable > word. Can somebody explain to me in small words what a vfio pin looks > like because I've tried reading vfio_iommu_type1_pin_pages() and I > don't recognise anything there that looks like pinning in either of > the other two senses. Oh, I think I found it! pin_user_pages_remote() is called by vaddr_get_pfns(). If these are the pages you're concerned about, then the efficient way to do what you want is simply to call folio_maybe_dma_pinned(). Far more efficient than the current mess of total_mapcount().
On 5/25/23 15:31, Matthew Wilcox wrote: > On Thu, May 25, 2023 at 09:45:34PM +0100, Matthew Wilcox wrote: >>>> Isn't "long term pinning" the wrong description of the problem? Long term >>>> pins suggest to me FOLL_LONGTERM. I think this is simple short term >>>> pins that we care about here. >>> >>> vfio pins are held for a long time - Steve >> >> So this is a third sense of "pinned pages" that is neither what >> filesystems nor the mm means by pinned pages, but whatever it is that >> vfio means by pinned pages? If only "pin" weren't such a desirable >> word. Can somebody explain to me in small words what a vfio pin looks >> like because I've tried reading vfio_iommu_type1_pin_pages() and I >> don't recognise anything there that looks like pinning in either of >> the other two senses. > > Oh, I think I found it! pin_user_pages_remote() is called by > vaddr_get_pfns(). If these are the pages you're concerned about, > then the efficient way to do what you want is simply to call > folio_maybe_dma_pinned(). Far more efficient than the current mess > of total_mapcount(). vfio pinned pages triggered this change. Wouldn't checking refcounts against mapcount provide a more generalized way of detecting non-migratable pages? Thanks, Khalid
On Fri, May 26, 2023 at 09:44:34AM -0600, Khalid Aziz wrote: > > Oh, I think I found it! pin_user_pages_remote() is called by > > vaddr_get_pfns(). If these are the pages you're concerned about, > > then the efficient way to do what you want is simply to call > > folio_maybe_dma_pinned(). Far more efficient than the current mess > > of total_mapcount(). > > vfio pinned pages triggered this change. Wouldn't checking refcounts against > mapcount provide a more generalized way of detecting non-migratable pages? Well, you changed the comment to say that we were concerned about long-term pins. If we are, than folio_maybe_dma_pinned() is how to test for long-term pins. If we want to skip pages which are short-term pinned, then we need to not change the comment, and keep using mapcount/refcount differences.
On 26.05.23 18:44, Matthew Wilcox wrote: > On Fri, May 26, 2023 at 09:44:34AM -0600, Khalid Aziz wrote: >>> Oh, I think I found it! pin_user_pages_remote() is called by >>> vaddr_get_pfns(). If these are the pages you're concerned about, >>> then the efficient way to do what you want is simply to call >>> folio_maybe_dma_pinned(). Far more efficient than the current mess >>> of total_mapcount(). >> >> vfio pinned pages triggered this change. Wouldn't checking refcounts against >> mapcount provide a more generalized way of detecting non-migratable pages? > > Well, you changed the comment to say that we were concerned about > long-term pins. If we are, than folio_maybe_dma_pinned() is how to test > for long-term pins. If we want to skip pages which are short-term pinned, > then we need to not change the comment, and keep using mapcount/refcount > differences. > folio_maybe_dma_pinned() is all about FOLL_PIN, not FOLL_LONGTERM. folio_maybe_dma_pinned() would skip migrating any page that has more than 1024 references. (shared libraries?)
On Fri, May 26, 2023 at 06:46:15PM +0200, David Hildenbrand wrote: > On 26.05.23 18:44, Matthew Wilcox wrote: > > On Fri, May 26, 2023 at 09:44:34AM -0600, Khalid Aziz wrote: > > > > Oh, I think I found it! pin_user_pages_remote() is called by > > > > vaddr_get_pfns(). If these are the pages you're concerned about, > > > > then the efficient way to do what you want is simply to call > > > > folio_maybe_dma_pinned(). Far more efficient than the current mess > > > > of total_mapcount(). > > > > > > vfio pinned pages triggered this change. Wouldn't checking refcounts against > > > mapcount provide a more generalized way of detecting non-migratable pages? > > > > Well, you changed the comment to say that we were concerned about > > long-term pins. If we are, than folio_maybe_dma_pinned() is how to test > > for long-term pins. If we want to skip pages which are short-term pinned, > > then we need to not change the comment, and keep using mapcount/refcount > > differences. > > > > folio_maybe_dma_pinned() is all about FOLL_PIN, not FOLL_LONGTERM. But according to our documentation, FOLL_LONGTERM implies FOLL_PIN. Anyway, right now, the code skips any pages which are merely FOLL_GET, so we'll skip fewer pages if we do only skip the FOLL_PIN ones, regardless if we'd prefer to only skip the FOLL_LONGTERM ones. > folio_maybe_dma_pinned() would skip migrating any page that has more than > 1024 references. (shared libraries?) True, but maybe we should be skipping any page with that many mappings, given how disruptive it is to the rest of the system to unmap a page from >1024 processes.
On 26.05.23 20:46, Matthew Wilcox wrote: > On Fri, May 26, 2023 at 06:46:15PM +0200, David Hildenbrand wrote: >> On 26.05.23 18:44, Matthew Wilcox wrote: >>> On Fri, May 26, 2023 at 09:44:34AM -0600, Khalid Aziz wrote: >>>>> Oh, I think I found it! pin_user_pages_remote() is called by >>>>> vaddr_get_pfns(). If these are the pages you're concerned about, >>>>> then the efficient way to do what you want is simply to call >>>>> folio_maybe_dma_pinned(). Far more efficient than the current mess >>>>> of total_mapcount(). >>>> >>>> vfio pinned pages triggered this change. Wouldn't checking refcounts against >>>> mapcount provide a more generalized way of detecting non-migratable pages? >>> >>> Well, you changed the comment to say that we were concerned about >>> long-term pins. If we are, than folio_maybe_dma_pinned() is how to test >>> for long-term pins. If we want to skip pages which are short-term pinned, >>> then we need to not change the comment, and keep using mapcount/refcount >>> differences. >>> >> >> folio_maybe_dma_pinned() is all about FOLL_PIN, not FOLL_LONGTERM. > > But according to our documentation, FOLL_LONGTERM implies FOLL_PIN. Yes. But folio_maybe_dma_pinned() will indicate both, long-term pins and short-term pins. There really is no way to distinguish both, unfortunately. > Anyway, right now, the code skips any pages which are merely FOLL_GET, > so we'll skip fewer pages if we do only skip the FOLL_PIN ones, > regardless if we'd prefer to only skip the FOLL_LONGTERM ones. > >> folio_maybe_dma_pinned() would skip migrating any page that has more than >> 1024 references. (shared libraries?) > > True, but maybe we should be skipping any page with that many mappings, > given how disruptive it is to the rest of the system to unmap a page > from >1024 processes. > So any user with 1024 processes can fragment physical memory? :/ Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned(). It's all suboptimal but let's not try to make it worse.
On 5/26/23 11:50, David Hildenbrand wrote: > On 26.05.23 20:46, Matthew Wilcox wrote: >> On Fri, May 26, 2023 at 06:46:15PM +0200, David Hildenbrand wrote: >>> On 26.05.23 18:44, Matthew Wilcox wrote: >>>> On Fri, May 26, 2023 at 09:44:34AM -0600, Khalid Aziz wrote: >>>>>> Oh, I think I found it! pin_user_pages_remote() is called by >>>>>> vaddr_get_pfns(). If these are the pages you're concerned about, >>>>>> then the efficient way to do what you want is simply to call >>>>>> folio_maybe_dma_pinned(). Far more efficient than the current mess >>>>>> of total_mapcount(). >>>>> >>>>> vfio pinned pages triggered this change. Wouldn't checking refcounts against >>>>> mapcount provide a more generalized way of detecting non-migratable pages? >>>> >>>> Well, you changed the comment to say that we were concerned about >>>> long-term pins. If we are, than folio_maybe_dma_pinned() is how to test >>>> for long-term pins. If we want to skip pages which are short-term pinned, >>>> then we need to not change the comment, and keep using mapcount/refcount >>>> differences. >>>> >>> >>> folio_maybe_dma_pinned() is all about FOLL_PIN, not FOLL_LONGTERM. >> >> But according to our documentation, FOLL_LONGTERM implies FOLL_PIN. > > Yes. But folio_maybe_dma_pinned() will indicate both, long-term pins and short-term pins. There really is no way to distinguish both, unfortunately. Not yet, anyway. :) > >> Anyway, right now, the code skips any pages which are merely FOLL_GET, >> so we'll skip fewer pages if we do only skip the FOLL_PIN ones, >> regardless if we'd prefer to only skip the FOLL_LONGTERM ones. >> >>> folio_maybe_dma_pinned() would skip migrating any page that has more than >>> 1024 references. (shared libraries?) >> >> True, but maybe we should be skipping any page with that many mappings, >> given how disruptive it is to the rest of the system to unmap a page >> from >1024 processes. >> > > So any user with 1024 processes can fragment physical memory? :/ > > Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned(). > I was actually thinking that we should minimize any more cases of fragile mapcount and refcount comparison, which then leads to Matthew's approach here! thanks,
On Fri, May 26, 2023 at 07:11:05PM -0700, John Hubbard wrote: > > So any user with 1024 processes can fragment physical memory? :/ > > > > Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned(). > > I was actually thinking that we should minimize any more cases of > fragile mapcount and refcount comparison, which then leads to > Matthew's approach here! I was wondering if we shouldn't make folio_maybe_dma_pinned() a little more accurate. eg: if (folio_test_large(folio)) return atomic_read(&folio->_pincount) > 0; return (unsigned)(folio_ref_count(folio) - folio_mapcount(folio)) >= GUP_PIN_COUNTING_BIAS;
On 5/26/23 20:18, Matthew Wilcox wrote: > On Fri, May 26, 2023 at 07:11:05PM -0700, John Hubbard wrote: >>> So any user with 1024 processes can fragment physical memory? :/ >>> >>> Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned(). >> >> I was actually thinking that we should minimize any more cases of >> fragile mapcount and refcount comparison, which then leads to >> Matthew's approach here! > > I was wondering if we shouldn't make folio_maybe_dma_pinned() a little > more accurate. eg: > > if (folio_test_large(folio)) > return atomic_read(&folio->_pincount) > 0; > return (unsigned)(folio_ref_count(folio) - folio_mapcount(folio)) >= > GUP_PIN_COUNTING_BIAS; I'm trying to figure out what might be wrong with that, but it seems OK. We must have talked about this earlier, but I recall vaguely that there was not a lot of concern about the case of a page being mapped > 1024 times. Because pinned or not, it's likely to be effectively locked into memory due to LRU effects. As mentioned here, too. Anyway, sure. A detail: The unsigned cast, I'm not sure that helps or solves anything, right? That is, other than bugs, is it possible to get refcount < mapcount? And if it's only due to bugs, then the casting, again, isn't likely to going to mitigate the fallout from whatever mess the bug caused. thanks,
On Sun, May 28, 2023 at 04:49:52PM -0700, John Hubbard wrote: > On 5/26/23 20:18, Matthew Wilcox wrote: > > On Fri, May 26, 2023 at 07:11:05PM -0700, John Hubbard wrote: > > > > So any user with 1024 processes can fragment physical memory? :/ > > > > > > > > Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned(). > > > > > > I was actually thinking that we should minimize any more cases of > > > fragile mapcount and refcount comparison, which then leads to > > > Matthew's approach here! > > > > I was wondering if we shouldn't make folio_maybe_dma_pinned() a little > > more accurate. eg: > > > > if (folio_test_large(folio)) > > return atomic_read(&folio->_pincount) > 0; > > return (unsigned)(folio_ref_count(folio) - folio_mapcount(folio)) >= > > GUP_PIN_COUNTING_BIAS; > > I'm trying to figure out what might be wrong with that, but it seems > OK. We must have talked about this earlier, but I recall vaguely that > there was not a lot of concern about the case of a page being mapped > > 1024 times. Because pinned or not, it's likely to be effectively > locked into memory due to LRU effects. As mentioned here, too. That was my point of view, but David convinced me that a hostile process can effectively lock its own memory into place. > Anyway, sure. > > A detail: > > The unsigned cast, I'm not sure that helps or solves anything, right? > That is, other than bugs, is it possible to get refcount < mapcount? > > And if it's only due to bugs, then the casting, again, isn't likely to > going to mitigate the fallout from whatever mess the bug caused. I wasn't thinking too hard about the cast. If the caller has the folio lock, I don't think it's possible for refcount < mapcount. This caller has a refcount, but doesn't hold the lock, so it is possible for them to read mapcount first, then have both mapcount and refcount decremented and see refcount < mapcount. I don't think it matters too much. We don't hold the folio lock, so it might transition from pinned to unpinned as much as a refcount might be decremented or a mapcount incremented. What's important is that a hostile process can't prevent memory from being moved indefinitely. David, have I missed something else?
Khalid Aziz <khalid.aziz@oracle.com> writes: > Pages pinned in memory through extra refcounts can not be migrated. > Currently as isolate_migratepages_block() scans pages for > compaction, it skips any pinned anonymous pages. All non-migratable > pages should be skipped and not just the anonymous pinned pages. > This patch adds a check for extra refcounts on a page to determine > if the page can be migrated. This was seen as a real issue on a > customer workload where a large number of pages were pinned by vfio > on the host and any attempts to allocate hugepages resulted in > significant amount of cpu time spent in either direct compaction or > in kcompactd scanning vfio pinned pages over and over again that can > not be migrated. These are the changes in relevant stats with this > patch for a test run of this scenario: > > Before After > compact_migrate_scanned 329,798,858 370,984,387 > compact_free_scanned 40,478,406 25,843,262 > compact_isolated 135,470,452 777,235 > pgmigrate_success 544,255 507,325 > pgmigrate_fail 134,616,282 47 > kcompactd CPU time 5:12.81 0:12.28 > > Before the patch, large number of pages were isolated but most of > them failed to migrate. > > Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> > Suggested-by: Steve Sistare <steven.sistare@oracle.com> > Cc: Khalid Aziz <khalid@kernel.org> > --- > v4: > - Use existing folio_expected_refs() function (Suggested > by Huang, Ying) > - Use folio functions > - Take into account contig allocations when checking for > long temr pinning and skip pages in ZONE_MOVABLE and > MIGRATE_CMA type pages (Suggested by David Hildenbrand) > - Use folio version of total_mapcount() instead of > page_mapcount() (Suggested by Baolin Wang) > > v3: > - Account for extra ref added by get_page_unless_zero() earlier > in isolate_migratepages_block() (Suggested by Huang, Ying) > - Clean up computation of extra refs to be consistent > (Suggested by Huang, Ying) > > v2: > - Update comments in the code (Suggested by Andrew) > - Use PagePrivate() instead of page_has_private() (Suggested > by Matthew) > - Pass mapping to page_has_extrarefs() (Suggested by Matthew) > - Use page_ref_count() (Suggested by Matthew) > - Rename is_pinned_page() to reflect its function more > accurately (Suggested by Matthew) > > include/linux/migrate.h | 16 +++++++++++++++ > mm/compaction.c | 44 +++++++++++++++++++++++++++++++++++++---- > mm/migrate.c | 14 ------------- > 3 files changed, 56 insertions(+), 18 deletions(-) > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index 6241a1596a75..4f59e15eae99 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -141,6 +141,22 @@ const struct movable_operations *page_movable_ops(struct page *page) > ((unsigned long)page->mapping - PAGE_MAPPING_MOVABLE); > } > > +static inline > +int folio_expected_refs(struct address_space *mapping, > + struct folio *folio) I don't think that it's necessary to make this function inline. It isn't called in hot path. > +{ > + int refs = 1; > + > + if (!mapping) > + return refs; > + > + refs += folio_nr_pages(folio); > + if (folio_test_private(folio)) > + refs++; > + > + return refs; > +} > + > #ifdef CONFIG_NUMA_BALANCING > int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, > int node); > diff --git a/mm/compaction.c b/mm/compaction.c > index 5a9501e0ae01..b548e05f0349 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -764,6 +764,42 @@ static bool too_many_isolated(pg_data_t *pgdat) > return too_many; > } > > +/* > + * Check if this base page should be skipped from isolation because > + * it has extra refcounts that will prevent it from being migrated. > + * This code is inspired by similar code in migrate_vma_check_page(), > + * can_split_folio() and folio_migrate_mapping() > + */ > +static inline bool page_has_extra_refs(struct page *page, > + struct address_space *mapping) > +{ > + unsigned long extra_refs; s/extra_refs/expected_refs/ ? > + struct folio *folio; > + > + /* > + * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA > + * pages that can not be long term pinned > + */ > + if (is_zone_movable_page(page) || is_migrate_cma_page(page)) > + return false; I suggest to move these 2 checks out to the place before calling the function. Or change the name of the function. > + > + folio = page_folio(page); > + > + /* > + * caller holds a ref already from get_page_unless_zero() > + * which is accounted for in folio_expected_refs() > + */ > + extra_refs = folio_expected_refs(mapping, folio); > + > + /* > + * This is an admittedly racy check but good enough to determine > + * if a page is pinned and can not be migrated > + */ > + if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio)) > + return true; > + return false; > +} > + > /** > * isolate_migratepages_block() - isolate all migrate-able pages within > * a single pageblock > @@ -992,12 +1028,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, > goto isolate_fail; > > /* > - * Migration will fail if an anonymous page is pinned in memory, > - * so avoid taking lru_lock and isolating it unnecessarily in an > - * admittedly racy check. > + * Migration will fail if a page has extra refcounts > + * from long term pinning preventing it from migrating, > + * so avoid taking lru_lock and isolating it unnecessarily. > */ > mapping = page_mapping(page); > - if (!mapping && (page_count(page) - 1) > total_mapcount(page)) > + if (!cc->alloc_contig && page_has_extra_refs(page, mapping)) > goto isolate_fail_put; > > /* > diff --git a/mm/migrate.c b/mm/migrate.c > index db3f154446af..a2f3e5834996 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -385,20 +385,6 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd) > } > #endif > > -static int folio_expected_refs(struct address_space *mapping, > - struct folio *folio) > -{ > - int refs = 1; > - if (!mapping) > - return refs; > - > - refs += folio_nr_pages(folio); > - if (folio_test_private(folio)) > - refs++; > - > - return refs; > -} > - > /* > * Replace the page in the mapping. > * Best Regards, Huang, Ying
On 29.05.23 02:31, Matthew Wilcox wrote: > On Sun, May 28, 2023 at 04:49:52PM -0700, John Hubbard wrote: >> On 5/26/23 20:18, Matthew Wilcox wrote: >>> On Fri, May 26, 2023 at 07:11:05PM -0700, John Hubbard wrote: >>>>> So any user with 1024 processes can fragment physical memory? :/ >>>>> >>>>> Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned(). >>>> >>>> I was actually thinking that we should minimize any more cases of >>>> fragile mapcount and refcount comparison, which then leads to >>>> Matthew's approach here! >>> >>> I was wondering if we shouldn't make folio_maybe_dma_pinned() a little >>> more accurate. eg: >>> >>> if (folio_test_large(folio)) >>> return atomic_read(&folio->_pincount) > 0; >>> return (unsigned)(folio_ref_count(folio) - folio_mapcount(folio)) >= >>> GUP_PIN_COUNTING_BIAS; >> >> I'm trying to figure out what might be wrong with that, but it seems >> OK. We must have talked about this earlier, but I recall vaguely that >> there was not a lot of concern about the case of a page being mapped >>> 1024 times. Because pinned or not, it's likely to be effectively >> locked into memory due to LRU effects. As mentioned here, too. > > That was my point of view, but David convinced me that a hostile process > can effectively lock its own memory into place. > 1) My opinion on this optimization Before I start going into detail, let me first phrase my opinion so we are on the same page: "a tiny fraction of Linux installations makes heavy use of long-term pinning -- the *single* mechanism that completely *destroys* the whole purpose of memory compaction -- and the users complain about memory compaction overhead. So we are talking about optimizing for that by eventually harming *everybody else*." Ehm ... I'm all for reasonable optimization, but not at any price. We don't care about a handful of long-term pinned pages in the system, this is really about vfio long-term pinning a significant amount of system RAM, and we only care about shmem here. *maybe* there is an issue with page migration when we have many page mappings, but (a) it's a separate issue and to be dealt with separately, not buried into such checks (b) it's unclear how many page mappings are too many, the magic number 1024 is just a random number (c) it needs much finer control (hostile processes). 2) mapcount vs. pagecount Now, doing these mapcount vs. pagecount checks is perfectly reasonable (see mm/ksm.c) as long as know what we're doing. For example, we have to make sure that a possible compound page cannot get split concurrently (e.g., hold a reference). It's been used forever, I consider it stable. I completely agree that we should be careful with such mapcount vs. pagecount checks, and if we can use something better, let's use something *better*. 3) page_maybe_dma_pinned() Now, why do I dislike bringing up page_maybe_dma_pinned() [IOW, why is it not better]? Besides it ignoring FOLL_GET for now, that might be fixed at some point. I think we agree that order-0 pages are the problem, because we get guaranteed false positives with many mappings (not just on speculative page pinnings). For these order-0 pages, it's perfectly reasonable to check page_maybe_dma_pinned() *as long as* we know the number of mappings is very small. I don't consider anon pages the issue here, we barely get 1024 mappings (not even with KSM), and it's much harder to exploit because you cannot simply get new mappings via mmap(), only via fork(). In fact, we could optimize easily for order-0 anon pages if we'd need to: I have a patch lying around, it just wasn't really worth it for now, because there is only a single relevant page_maybe_dma_pinned() call in vmscan that could benefit: https://github.com/davidhildenbrand/linux/commit/0575860d064694d4e2f307b2c20a880a6a7b59ab We cannot do the same for pagecache pages, so we would possibly introduce harm by carelessly checking page_maybe_dma_pinned() on pages with many mappings. 4) folio_maybe_dma_longterm_pinned() ? I thought yesterday if we'd want something like folio_maybe_dma_longterm_pinned() here. Essentially using what we learned about long-term pinning of fs pages: (1) ZONE_MOVABLE, MIGRATE_CMA -> "return false;" (2) If !anon, !hugetlb, !shmem -> "return false;" (3) "return folio_maybe_dma_pinned()" Yes, above would easily create false-positives for short-term pinned pages (anon/hugetlb/shmem), but would never create false-positives for any other page (shared library ...). We would use it in the following way: bool skip_folio_in_isolation() { /* * Avoid skipping pages that are short-term pinned, the pin * might go away any moment and we'll succeed to migrate. * * We get false positives for short-term pinned anon, shmem and * hugetl pages for now, but such short-term pins are transient. */ if (!folio_maybe_dma_longterm_pinned()) return false; /* * order-0 pages with many mappings can easily be confused * for pinned pages and this could be exploited by * malicious user-space to cause fragmentation. This is only * an optimization, so if a page (especially shmem) is mapped * many times, we'll rather try migrating it instead of * accidentally skipping it all the time. */ return folio_order(folio) != 0 || && total_mappings <= 32) } Someone long-term pins an shmem page with many mappings? Too bad, we don't optimize for that and still try migrating it. BUT, I am still confused if we want to check here for "any additional references", which is what mapcount vs. refcount is, or folio_maybe_dma_longterm_pinned(). Of course, we could similarly write a variant of skip_folio_in_isolation: bool skip_folio_in_isolation() { /* * If a page is not pinned, try migrating it. Note that this * does not consider any FOLL_GET used for DMA yet. */ if (!folio_maybe_dma_pinned()) return false; /* * order-0 pages with many mappings can easily be confused * for pinned pages and this could be exploited by * malicious user-space to cause fragmentation. This is only * an optimization, so if a page is mapped * many times, we'll rather try migrating it instead of * accidentally skipping it all the time. */ return folio_order(folio) != 0 || && total_mappings <= 32) } As long as FOLL_GET is still used for DMA, the mapcount vs. pagecount checks might be better ... but it depends on if we care about short-term or long-term pinned pages here. >> Anyway, sure. >> >> A detail: >> >> The unsigned cast, I'm not sure that helps or solves anything, right? >> That is, other than bugs, is it possible to get refcount < mapcount? BUG IMHO. >> >> And if it's only due to bugs, then the casting, again, isn't likely to >> going to mitigate the fallout from whatever mess the bug caused. > > I wasn't thinking too hard about the cast. If the caller has the folio > lock, I don't think it's possible for refcount < mapcount. This caller > has a refcount, but doesn't hold the lock, so it is possible for them > to read mapcount first, then have both mapcount and refcount decremented > and see refcount < mapcount. > > I don't think it matters too much. We don't hold the folio lock, so > it might transition from pinned to unpinned as much as a refcount might > be decremented or a mapcount incremented. What's important is that a > hostile process can't prevent memory from being moved indefinitely. > > David, have I missed something else? What I learned from staring at the code in mm/ksm.c:write_protect_page() for too long a while ago is that: (1) Mapping a page first increments the refcount, then the mapcount (2) Unmapping a page first decrements the mapcount, then the refcount So the mapcount is supposed to be always larger than the refcount. Especially, if you take a snapshot of both (read first the mapcount, then the mapcount). A hostile process wouldn't be able to block compaction here forever, even if we accidentally would make the wrong call once when deciding whether to isolate a page. It would work on the next attempt. That's the difference to page_maybe_dma_pinned(), which can be made to consistently block compaction. [sorry for the lengthy mail]
On 5/29/23 03:25, David Hildenbrand wrote: > On 29.05.23 02:31, Matthew Wilcox wrote: >> On Sun, May 28, 2023 at 04:49:52PM -0700, John Hubbard wrote: >>> On 5/26/23 20:18, Matthew Wilcox wrote: >>>> On Fri, May 26, 2023 at 07:11:05PM -0700, John Hubbard wrote: >>>>>> So any user with 1024 processes can fragment physical memory? :/ >>>>>> >>>>>> Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned(). >>>>> >>>>> I was actually thinking that we should minimize any more cases of >>>>> fragile mapcount and refcount comparison, which then leads to >>>>> Matthew's approach here! >>>> >>>> I was wondering if we shouldn't make folio_maybe_dma_pinned() a little >>>> more accurate. eg: >>>> >>>> if (folio_test_large(folio)) >>>> return atomic_read(&folio->_pincount) > 0; >>>> return (unsigned)(folio_ref_count(folio) - folio_mapcount(folio)) >= >>>> GUP_PIN_COUNTING_BIAS; >>> >>> I'm trying to figure out what might be wrong with that, but it seems >>> OK. We must have talked about this earlier, but I recall vaguely that >>> there was not a lot of concern about the case of a page being mapped >>>> 1024 times. Because pinned or not, it's likely to be effectively >>> locked into memory due to LRU effects. As mentioned here, too. >> >> That was my point of view, but David convinced me that a hostile process >> can effectively lock its own memory into place. >> > > 1) My opinion on this optimization > > Before I start going into detail, let me first phrase my opinion so we are on the same page: > > "a tiny fraction of Linux installations makes heavy use of long-term pinning -- the *single* mechanism that completely > *destroys* the whole purpose of memory compaction -- and the users complain about memory compaction overhead. So we are > talking about optimizing for that by eventually harming *everybody else*." > > Ehm ... I'm all for reasonable optimization, but not at any price. > > We don't care about a handful of long-term pinned pages in the system, this is really about vfio long-term pinning a > significant amount of system RAM, and we only care about shmem here. > > > *maybe* there is an issue with page migration when we have many page mappings, but (a) it's a separate issue and to be > dealt with separately, not buried into such checks (b) it's unclear how many page mappings are too many, the magic > number 1024 is just a random number (c) it needs much finer control (hostile processes). > > > 2) mapcount vs. pagecount > > Now, doing these mapcount vs. pagecount checks is perfectly reasonable (see mm/ksm.c) as long as know what we're doing. > For example, we have to make sure that a possible compound page cannot get split concurrently (e.g., hold a reference). > It's been used forever, I consider it stable. > > I completely agree that we should be careful with such mapcount vs. pagecount checks, and if we can use something > better, let's use something *better*. When we have a reliable folio_maybe_dma_longterm_pinned() function, it will be better to call that instead of doing refcount vs mapcount check. Until that better function to check for pinned pages is in place, may I propose that the current patch fixes a customer problem though not optimally and is a good enough working solution. When a better function is in place, page_has_extra_refs() function can be updated to rely on this other function instead of refcount vs mapcount. Thanks, Khalid > > > 3) page_maybe_dma_pinned() > > Now, why do I dislike bringing up page_maybe_dma_pinned() [IOW, why is it not better]? Besides it ignoring FOLL_GET for > now, that might be fixed at some point. > > I think we agree that order-0 pages are the problem, because we get guaranteed false positives with many mappings (not > just on speculative page pinnings). For these order-0 pages, it's perfectly reasonable to check page_maybe_dma_pinned() > *as long as* we know the number of mappings is very small. > > I don't consider anon pages the issue here, we barely get 1024 mappings (not even with KSM), and it's much harder to > exploit because you cannot simply get new mappings via mmap(), only via fork(). > > In fact, we could optimize easily for order-0 anon pages if we'd need to: I have a patch lying around, it just wasn't > really worth it for now, because there is only a single relevant page_maybe_dma_pinned() call in vmscan that could benefit: > > https://github.com/davidhildenbrand/linux/commit/0575860d064694d4e2f307b2c20a880a6a7b59ab > > We cannot do the same for pagecache pages, so we would possibly introduce harm by carelessly checking > page_maybe_dma_pinned() on pages > with many mappings. > > > 4) folio_maybe_dma_longterm_pinned() ? > > I thought yesterday if we'd want something like folio_maybe_dma_longterm_pinned() here. Essentially using what we > learned about long-term pinning of fs pages: > > (1) ZONE_MOVABLE, MIGRATE_CMA -> "return false;" > (2) If !anon, !hugetlb, !shmem -> "return false;" > (3) "return folio_maybe_dma_pinned()" > > Yes, above would easily create false-positives for short-term pinned pages (anon/hugetlb/shmem), but would never create > false-positives for any other page (shared library ...). > > > We would use it in the following way: > > bool skip_folio_in_isolation() > { > /* > * Avoid skipping pages that are short-term pinned, the pin > * might go away any moment and we'll succeed to migrate. > * > * We get false positives for short-term pinned anon, shmem and > * hugetl pages for now, but such short-term pins are transient. > */ > if (!folio_maybe_dma_longterm_pinned()) > return false; > /* > * order-0 pages with many mappings can easily be confused > * for pinned pages and this could be exploited by > * malicious user-space to cause fragmentation. This is only > * an optimization, so if a page (especially shmem) is mapped > * many times, we'll rather try migrating it instead of > * accidentally skipping it all the time. > */ > return folio_order(folio) != 0 || && total_mappings <= 32) > } > > Someone long-term pins an shmem page with many mappings? Too bad, we don't optimize for that and still try migrating it. > > > BUT, I am still confused if we want to check here for "any additional references", which is what mapcount vs. refcount > is, or folio_maybe_dma_longterm_pinned(). > > Of course, we could similarly write a variant of skip_folio_in_isolation: > > bool skip_folio_in_isolation() > { > /* > * If a page is not pinned, try migrating it. Note that this > * does not consider any FOLL_GET used for DMA yet. > */ > if (!folio_maybe_dma_pinned()) > return false; > /* > * order-0 pages with many mappings can easily be confused > * for pinned pages and this could be exploited by > * malicious user-space to cause fragmentation. This is only > * an optimization, so if a page is mapped > * many times, we'll rather try migrating it instead of > * accidentally skipping it all the time. > */ > return folio_order(folio) != 0 || && total_mappings <= 32) > } > > > As long as FOLL_GET is still used for DMA, the mapcount vs. pagecount checks might be better ... but it depends on if we > care about short-term or long-term pinned pages here. > >>> Anyway, sure. >>> >>> A detail: >>> >>> The unsigned cast, I'm not sure that helps or solves anything, right? >>> That is, other than bugs, is it possible to get refcount < mapcount? > > BUG IMHO. > >>> >>> And if it's only due to bugs, then the casting, again, isn't likely to >>> going to mitigate the fallout from whatever mess the bug caused. >> >> I wasn't thinking too hard about the cast. If the caller has the folio >> lock, I don't think it's possible for refcount < mapcount. This caller >> has a refcount, but doesn't hold the lock, so it is possible for them >> to read mapcount first, then have both mapcount and refcount decremented >> and see refcount < mapcount. >> >> I don't think it matters too much. We don't hold the folio lock, so >> it might transition from pinned to unpinned as much as a refcount might >> be decremented or a mapcount incremented. What's important is that a >> hostile process can't prevent memory from being moved indefinitely. >> >> David, have I missed something else? > > > What I learned from staring at the code in mm/ksm.c:write_protect_page() for too long a while ago is that: > > (1) Mapping a page first increments the refcount, then the mapcount > (2) Unmapping a page first decrements the mapcount, then the refcount > > So the mapcount is supposed to be always larger than the refcount. Especially, if you take a snapshot of both (read > first the mapcount, then the mapcount). > > A hostile process wouldn't be able to block compaction here forever, even if we accidentally would make the wrong call > once when deciding whether to isolate a page. It would work on the next attempt. > > That's the difference to page_maybe_dma_pinned(), which can be made to consistently block compaction. > > > [sorry for the lengthy mail] >
On Tue, 30 May 2023 09:42:33 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote: > > I completely agree that we should be careful with such mapcount vs. pagecount checks, and if we can use something > > better, let's use something *better*. > > When we have a reliable folio_maybe_dma_longterm_pinned() function, it will be better to call that instead of doing > refcount vs mapcount check. Until that better function to check for pinned pages is in place, may I propose that the > current patch fixes a customer problem though not optimally and is a good enough working solution. When a better > function is in place, page_has_extra_refs() function can be updated to rely on this other function instead of refcount > vs mapcount. We seem rather stuck with this patch. I think I'll drop it while we ponder a way forward.
On 6/9/23 16:11, Andrew Morton wrote: > On Tue, 30 May 2023 09:42:33 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote: > >>> I completely agree that we should be careful with such mapcount vs. pagecount checks, and if we can use something >>> better, let's use something *better*. >> >> When we have a reliable folio_maybe_dma_longterm_pinned() function, it will be better to call that instead of doing >> refcount vs mapcount check. Until that better function to check for pinned pages is in place, may I propose that the >> current patch fixes a customer problem though not optimally and is a good enough working solution. When a better >> function is in place, page_has_extra_refs() function can be updated to rely on this other function instead of refcount >> vs mapcount. > > We seem rather stuck with this patch. I think I'll drop it while we > ponder a way forward. Hi Andrew, I would still recommend that we use this patch and fix the existing problem, unless there are objections from other people. This is not an optimal solution but it works and is similar to what kernel does elsewhere. When a better function to check pinned pages is available, we start using that function instead. Thanks, Khalid
diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 6241a1596a75..4f59e15eae99 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -141,6 +141,22 @@ const struct movable_operations *page_movable_ops(struct page *page) ((unsigned long)page->mapping - PAGE_MAPPING_MOVABLE); } +static inline +int folio_expected_refs(struct address_space *mapping, + struct folio *folio) +{ + int refs = 1; + + if (!mapping) + return refs; + + refs += folio_nr_pages(folio); + if (folio_test_private(folio)) + refs++; + + return refs; +} + #ifdef CONFIG_NUMA_BALANCING int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node); diff --git a/mm/compaction.c b/mm/compaction.c index 5a9501e0ae01..b548e05f0349 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -764,6 +764,42 @@ static bool too_many_isolated(pg_data_t *pgdat) return too_many; } +/* + * Check if this base page should be skipped from isolation because + * it has extra refcounts that will prevent it from being migrated. + * This code is inspired by similar code in migrate_vma_check_page(), + * can_split_folio() and folio_migrate_mapping() + */ +static inline bool page_has_extra_refs(struct page *page, + struct address_space *mapping) +{ + unsigned long extra_refs; + struct folio *folio; + + /* + * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA + * pages that can not be long term pinned + */ + if (is_zone_movable_page(page) || is_migrate_cma_page(page)) + return false; + + folio = page_folio(page); + + /* + * caller holds a ref already from get_page_unless_zero() + * which is accounted for in folio_expected_refs() + */ + extra_refs = folio_expected_refs(mapping, folio); + + /* + * This is an admittedly racy check but good enough to determine + * if a page is pinned and can not be migrated + */ + if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio)) + return true; + return false; +} + /** * isolate_migratepages_block() - isolate all migrate-able pages within * a single pageblock @@ -992,12 +1028,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, goto isolate_fail; /* - * Migration will fail if an anonymous page is pinned in memory, - * so avoid taking lru_lock and isolating it unnecessarily in an - * admittedly racy check. + * Migration will fail if a page has extra refcounts + * from long term pinning preventing it from migrating, + * so avoid taking lru_lock and isolating it unnecessarily. */ mapping = page_mapping(page); - if (!mapping && (page_count(page) - 1) > total_mapcount(page)) + if (!cc->alloc_contig && page_has_extra_refs(page, mapping)) goto isolate_fail_put; /* diff --git a/mm/migrate.c b/mm/migrate.c index db3f154446af..a2f3e5834996 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -385,20 +385,6 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd) } #endif -static int folio_expected_refs(struct address_space *mapping, - struct folio *folio) -{ - int refs = 1; - if (!mapping) - return refs; - - refs += folio_nr_pages(folio); - if (folio_test_private(folio)) - refs++; - - return refs; -} - /* * Replace the page in the mapping. *
Pages pinned in memory through extra refcounts can not be migrated. Currently as isolate_migratepages_block() scans pages for compaction, it skips any pinned anonymous pages. All non-migratable pages should be skipped and not just the anonymous pinned pages. This patch adds a check for extra refcounts on a page to determine if the page can be migrated. This was seen as a real issue on a customer workload where a large number of pages were pinned by vfio on the host and any attempts to allocate hugepages resulted in significant amount of cpu time spent in either direct compaction or in kcompactd scanning vfio pinned pages over and over again that can not be migrated. These are the changes in relevant stats with this patch for a test run of this scenario: Before After compact_migrate_scanned 329,798,858 370,984,387 compact_free_scanned 40,478,406 25,843,262 compact_isolated 135,470,452 777,235 pgmigrate_success 544,255 507,325 pgmigrate_fail 134,616,282 47 kcompactd CPU time 5:12.81 0:12.28 Before the patch, large number of pages were isolated but most of them failed to migrate. Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Suggested-by: Steve Sistare <steven.sistare@oracle.com> Cc: Khalid Aziz <khalid@kernel.org> --- v4: - Use existing folio_expected_refs() function (Suggested by Huang, Ying) - Use folio functions - Take into account contig allocations when checking for long temr pinning and skip pages in ZONE_MOVABLE and MIGRATE_CMA type pages (Suggested by David Hildenbrand) - Use folio version of total_mapcount() instead of page_mapcount() (Suggested by Baolin Wang) v3: - Account for extra ref added by get_page_unless_zero() earlier in isolate_migratepages_block() (Suggested by Huang, Ying) - Clean up computation of extra refs to be consistent (Suggested by Huang, Ying) v2: - Update comments in the code (Suggested by Andrew) - Use PagePrivate() instead of page_has_private() (Suggested by Matthew) - Pass mapping to page_has_extrarefs() (Suggested by Matthew) - Use page_ref_count() (Suggested by Matthew) - Rename is_pinned_page() to reflect its function more accurately (Suggested by Matthew) include/linux/migrate.h | 16 +++++++++++++++ mm/compaction.c | 44 +++++++++++++++++++++++++++++++++++++---- mm/migrate.c | 14 ------------- 3 files changed, 56 insertions(+), 18 deletions(-)