Message ID | 20190531064313.193437-2-minchan@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | introduce memory hinting API for external process | expand |
On Fri 31-05-19 15:43:08, Minchan Kim wrote: > When a process expects no accesses to a certain memory range, it could > give a hint to kernel that the pages can be reclaimed when memory pressure > happens but data should be preserved for future use. This could reduce > workingset eviction so it ends up increasing performance. > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > MADV_COLD can be used by a process to mark a memory range as not expected > to be used in the near future. The hint can help kernel in deciding which > pages to evict early during memory pressure. > > Internally, it works via deactivating pages from active list to inactive's > head if the page is private because inactive list could be full of > used-once pages which are first candidate for the reclaiming and that's a > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > if the memory pressure happens, they will be reclaimed earlier than other > active pages unless there is no access until the time. [I am intentionally not looking at the implementation because below points should be clear from the changelog - sorry about nagging ;)] What kind of pages can be deactivated? Anonymous/File backed. Private/shared? If shared, are there any restrictions? Are there any restrictions on mappings? E.g. what would be an effect of this operation on hugetlbfs mapping? Also you are talking about inactive LRU but what kind of LRU is that? Is it the anonymous LRU? If yes, don't we have the same problem as with the early MADV_FREE implementation when enough page cache causes that deactivated anonymous memory doesn't get reclaimed anytime soon. Or worse never when there is no swap available?
On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > When a process expects no accesses to a certain memory range, it could > > give a hint to kernel that the pages can be reclaimed when memory pressure > > happens but data should be preserved for future use. This could reduce > > workingset eviction so it ends up increasing performance. > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > MADV_COLD can be used by a process to mark a memory range as not expected > > to be used in the near future. The hint can help kernel in deciding which > > pages to evict early during memory pressure. > > > > Internally, it works via deactivating pages from active list to inactive's > > head if the page is private because inactive list could be full of > > used-once pages which are first candidate for the reclaiming and that's a > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > if the memory pressure happens, they will be reclaimed earlier than other > > active pages unless there is no access until the time. > > [I am intentionally not looking at the implementation because below > points should be clear from the changelog - sorry about nagging ;)] > > What kind of pages can be deactivated? Anonymous/File backed. > Private/shared? If shared, are there any restrictions? Both file and private pages could be deactived from each active LRU to each inactive LRU if the page has one map_count. In other words, if (page_mapcount(page) <= 1) deactivate_page(page); > > Are there any restrictions on mappings? E.g. what would be an effect of > this operation on hugetlbfs mapping? VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED > > Also you are talking about inactive LRU but what kind of LRU is that? Is > it the anonymous LRU? If yes, don't we have the same problem as with the active file page -> inactive file LRU active anon page -> inacdtive anon LRU > early MADV_FREE implementation when enough page cache causes that > deactivated anonymous memory doesn't get reclaimed anytime soon. Or > worse never when there is no swap available? I think MADV_COLD is a little bit different symantic with MAVD_FREE. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage*. Furthemore, freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. However, MADV_COLD doesn't means it's a garbage and freeing requires swap out/swap in afterward. So, it would be better to move inactive anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary scanning of those cold anonymous if system doesn't have a swap device.
On Fri 31-05-19 22:39:04, Minchan Kim wrote: > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > When a process expects no accesses to a certain memory range, it could > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > happens but data should be preserved for future use. This could reduce > > > workingset eviction so it ends up increasing performance. > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > to be used in the near future. The hint can help kernel in deciding which > > > pages to evict early during memory pressure. > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > head if the page is private because inactive list could be full of > > > used-once pages which are first candidate for the reclaiming and that's a > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > if the memory pressure happens, they will be reclaimed earlier than other > > > active pages unless there is no access until the time. > > > > [I am intentionally not looking at the implementation because below > > points should be clear from the changelog - sorry about nagging ;)] > > > > What kind of pages can be deactivated? Anonymous/File backed. > > Private/shared? If shared, are there any restrictions? > > Both file and private pages could be deactived from each active LRU > to each inactive LRU if the page has one map_count. In other words, > > if (page_mapcount(page) <= 1) > deactivate_page(page); Why do we restrict to pages that are single mapped? > > Are there any restrictions on mappings? E.g. what would be an effect of > > this operation on hugetlbfs mapping? > > VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED OK documenting that this is restricted to the same vmas as MADV_FREE|DONTNEED is really useful to mention. > > > > > Also you are talking about inactive LRU but what kind of LRU is that? Is > > it the anonymous LRU? If yes, don't we have the same problem as with the > > active file page -> inactive file LRU > active anon page -> inacdtive anon LRU > > > early MADV_FREE implementation when enough page cache causes that > > deactivated anonymous memory doesn't get reclaimed anytime soon. Or > > worse never when there is no swap available? > > I think MADV_COLD is a little bit different symantic with MAVD_FREE. > MADV_FREE means it's okay to discard when the memory pressure because > the content of the page is *garbage*. Furthemore, freeing such pages is > almost zero overhead since we don't need to swap out and access > afterward causes minor fault. Thus, it would make sense to put those > freeable pages in inactive file LRU to compete other used-once pages. > > However, MADV_COLD doesn't means it's a garbage and freeing requires > swap out/swap in afterward. So, it would be better to move inactive > anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary > scanning of those cold anonymous if system doesn't have a swap device. Please document this, if this is really a desirable semantic because then you have the same set of problems as we've had with the early MADV_FREE implementation mentioned above.
On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > When a process expects no accesses to a certain memory range, it could > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > happens but data should be preserved for future use. This could reduce > > > > workingset eviction so it ends up increasing performance. > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > to be used in the near future. The hint can help kernel in deciding which > > > > pages to evict early during memory pressure. > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > head if the page is private because inactive list could be full of > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > active pages unless there is no access until the time. > > > > > > [I am intentionally not looking at the implementation because below > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > Private/shared? If shared, are there any restrictions? > > > > Both file and private pages could be deactived from each active LRU > > to each inactive LRU if the page has one map_count. In other words, > > > > if (page_mapcount(page) <= 1) > > deactivate_page(page); > > Why do we restrict to pages that are single mapped? Because page table in one of process shared the page would have access bit so finally we couldn't reclaim the page. The more process it is shared, the more fail to reclaim. > > > > Are there any restrictions on mappings? E.g. what would be an effect of > > > this operation on hugetlbfs mapping? > > > > VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED > > OK documenting that this is restricted to the same vmas as MADV_FREE|DONTNEED > is really useful to mention. Sure. > > > > > > > > > Also you are talking about inactive LRU but what kind of LRU is that? Is > > > it the anonymous LRU? If yes, don't we have the same problem as with the > > > > active file page -> inactive file LRU > > active anon page -> inacdtive anon LRU > > > > > early MADV_FREE implementation when enough page cache causes that > > > deactivated anonymous memory doesn't get reclaimed anytime soon. Or > > > worse never when there is no swap available? > > > > I think MADV_COLD is a little bit different symantic with MAVD_FREE. > > MADV_FREE means it's okay to discard when the memory pressure because > > the content of the page is *garbage*. Furthemore, freeing such pages is > > almost zero overhead since we don't need to swap out and access > > afterward causes minor fault. Thus, it would make sense to put those > > freeable pages in inactive file LRU to compete other used-once pages. > > > > However, MADV_COLD doesn't means it's a garbage and freeing requires > > swap out/swap in afterward. So, it would be better to move inactive > > anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary > > scanning of those cold anonymous if system doesn't have a swap device. > > Please document this, if this is really a desirable semantic because > then you have the same set of problems as we've had with the early > MADV_FREE implementation mentioned above. IIRC, the problem of MADV_FREE was that we couldn't discard freeable pages because VM never scan anonymous LRU with swapless system. However, it's not the our case because we should reclaim them, not discarding. I will include it in the description. Thanks.
On Fri 31-05-19 23:34:07, Minchan Kim wrote: > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > > When a process expects no accesses to a certain memory range, it could > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > > happens but data should be preserved for future use. This could reduce > > > > > workingset eviction so it ends up increasing performance. > > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > > to be used in the near future. The hint can help kernel in deciding which > > > > > pages to evict early during memory pressure. > > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > > head if the page is private because inactive list could be full of > > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > > active pages unless there is no access until the time. > > > > > > > > [I am intentionally not looking at the implementation because below > > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > > Private/shared? If shared, are there any restrictions? > > > > > > Both file and private pages could be deactived from each active LRU > > > to each inactive LRU if the page has one map_count. In other words, > > > > > > if (page_mapcount(page) <= 1) > > > deactivate_page(page); > > > > Why do we restrict to pages that are single mapped? > > Because page table in one of process shared the page would have access bit > so finally we couldn't reclaim the page. The more process it is shared, > the more fail to reclaim. So what? In other words why should it be restricted solely based on the map count. I can see a reason to restrict based on the access permissions because we do not want to simplify all sorts of side channel attacks but memory reclaim is capable of reclaiming shared pages and so far I haven't heard any sound argument why madvise should skip those. Again if there are any reasons, then document them in the changelog. [...] > > Please document this, if this is really a desirable semantic because > > then you have the same set of problems as we've had with the early > > MADV_FREE implementation mentioned above. > > IIRC, the problem of MADV_FREE was that we couldn't discard freeable > pages because VM never scan anonymous LRU with swapless system. > However, it's not the our case because we should reclaim them, not > discarding. Right. But there is still the page cache reclaim. Is it expected that an explicitly cold memory doesn't get reclaimed because we have a sufficient amount of page cache (a very common case) and we never age anonymous memory because of that?
On Mon, Jun 3, 2019 at 12:16 AM Michal Hocko <mhocko@kernel.org> wrote: > On Fri 31-05-19 23:34:07, Minchan Kim wrote: > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > > > When a process expects no accesses to a certain memory range, it could > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > > > happens but data should be preserved for future use. This could reduce > > > > > > workingset eviction so it ends up increasing performance. > > > > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > > > to be used in the near future. The hint can help kernel in deciding which > > > > > > pages to evict early during memory pressure. > > > > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > > > head if the page is private because inactive list could be full of > > > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > > > active pages unless there is no access until the time. > > > > > > > > > > [I am intentionally not looking at the implementation because below > > > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > > > Private/shared? If shared, are there any restrictions? > > > > > > > > Both file and private pages could be deactived from each active LRU > > > > to each inactive LRU if the page has one map_count. In other words, > > > > > > > > if (page_mapcount(page) <= 1) > > > > deactivate_page(page); > > > > > > Why do we restrict to pages that are single mapped? > > > > Because page table in one of process shared the page would have access bit > > so finally we couldn't reclaim the page. The more process it is shared, > > the more fail to reclaim. > > So what? In other words why should it be restricted solely based on the > map count. I can see a reason to restrict based on the access > permissions because we do not want to simplify all sorts of side channel > attacks but memory reclaim is capable of reclaiming shared pages and so > far I haven't heard any sound argument why madvise should skip those. > Again if there are any reasons, then document them in the changelog. Whether to reclaim shared pages is a policy decision best left to userland, IMHO.
On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote: > On Fri 31-05-19 23:34:07, Minchan Kim wrote: > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > > > When a process expects no accesses to a certain memory range, it could > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > > > happens but data should be preserved for future use. This could reduce > > > > > > workingset eviction so it ends up increasing performance. > > > > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > > > to be used in the near future. The hint can help kernel in deciding which > > > > > > pages to evict early during memory pressure. > > > > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > > > head if the page is private because inactive list could be full of > > > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > > > active pages unless there is no access until the time. > > > > > > > > > > [I am intentionally not looking at the implementation because below > > > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > > > Private/shared? If shared, are there any restrictions? > > > > > > > > Both file and private pages could be deactived from each active LRU > > > > to each inactive LRU if the page has one map_count. In other words, > > > > > > > > if (page_mapcount(page) <= 1) > > > > deactivate_page(page); > > > > > > Why do we restrict to pages that are single mapped? > > > > Because page table in one of process shared the page would have access bit > > so finally we couldn't reclaim the page. The more process it is shared, > > the more fail to reclaim. > > So what? In other words why should it be restricted solely based on the > map count. I can see a reason to restrict based on the access > permissions because we do not want to simplify all sorts of side channel > attacks but memory reclaim is capable of reclaiming shared pages and so > far I haven't heard any sound argument why madvise should skip those. > Again if there are any reasons, then document them in the changelog. I think it makes sense. It could be explained, but it also follows established madvise semantics, and I'm not sure it's necessarily Minchan's job to re-iterate those. Sharing isn't exactly transparent to userspace. The kernel does COW, ksm etc. When you madvise, you can really only speak for your own reference to that memory - "*I* am not using this." This is in line with other madvise calls: MADV_DONTNEED clears the local page table entries and drops the corresponding references, so shared pages won't get freed. MADV_FREE clears the pte dirty bit and also has explicit mapcount checks before clearing PG_dirty, so again shared pages don't get freed.
On Mon 03-06-19 13:27:17, Johannes Weiner wrote: > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote: > > On Fri 31-05-19 23:34:07, Minchan Kim wrote: > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > > > > When a process expects no accesses to a certain memory range, it could > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > > > > happens but data should be preserved for future use. This could reduce > > > > > > > workingset eviction so it ends up increasing performance. > > > > > > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > > > > to be used in the near future. The hint can help kernel in deciding which > > > > > > > pages to evict early during memory pressure. > > > > > > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > > > > head if the page is private because inactive list could be full of > > > > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > > > > active pages unless there is no access until the time. > > > > > > > > > > > > [I am intentionally not looking at the implementation because below > > > > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > > > > Private/shared? If shared, are there any restrictions? > > > > > > > > > > Both file and private pages could be deactived from each active LRU > > > > > to each inactive LRU if the page has one map_count. In other words, > > > > > > > > > > if (page_mapcount(page) <= 1) > > > > > deactivate_page(page); > > > > > > > > Why do we restrict to pages that are single mapped? > > > > > > Because page table in one of process shared the page would have access bit > > > so finally we couldn't reclaim the page. The more process it is shared, > > > the more fail to reclaim. > > > > So what? In other words why should it be restricted solely based on the > > map count. I can see a reason to restrict based on the access > > permissions because we do not want to simplify all sorts of side channel > > attacks but memory reclaim is capable of reclaiming shared pages and so > > far I haven't heard any sound argument why madvise should skip those. > > Again if there are any reasons, then document them in the changelog. > > I think it makes sense. It could be explained, but it also follows > established madvise semantics, and I'm not sure it's necessarily > Minchan's job to re-iterate those. > > Sharing isn't exactly transparent to userspace. The kernel does COW, > ksm etc. When you madvise, you can really only speak for your own > reference to that memory - "*I* am not using this." > > This is in line with other madvise calls: MADV_DONTNEED clears the > local page table entries and drops the corresponding references, so > shared pages won't get freed. MADV_FREE clears the pte dirty bit and > also has explicit mapcount checks before clearing PG_dirty, so again > shared pages don't get freed. Right, being consistent with other madvise syscalls is certainly a way to go. And I am not pushing one way or another, I just want this to be documented with a reasoning behind. Consistency is certainly an argument to use. On the other hand these non-destructive madvise operations are quite different and the shared policy might differ as a result as well. We are aging objects rather than destroying them after all. Being able to age a pagecache with a sufficient privileges sounds like a useful usecase to me. In other words you are able to cause the same effect indirectly without the madvise operation so it kinda makes sense to allow it in a more sophisticated way. That being said, madvise is just a _hint_ and the kernel will be always free to ignore it so the future implementation might change so we can start simple and consistent with existing MADV_$FOO operations now and extend later on. But let's document the intention in the changelog and make the decision clear. I am sorry to be so anal about this but I have seen so many ad-hoc policies that were undocumented and it was so hard to guess when revisiting later on and make some sense of it.
On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote: > On Mon 03-06-19 13:27:17, Johannes Weiner wrote: > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote: > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote: > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > > > > > When a process expects no accesses to a certain memory range, it could > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > > > > > happens but data should be preserved for future use. This could reduce > > > > > > > > workingset eviction so it ends up increasing performance. > > > > > > > > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > > > > > to be used in the near future. The hint can help kernel in deciding which > > > > > > > > pages to evict early during memory pressure. > > > > > > > > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > > > > > head if the page is private because inactive list could be full of > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > > > > > active pages unless there is no access until the time. > > > > > > > > > > > > > > [I am intentionally not looking at the implementation because below > > > > > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > > > > > Private/shared? If shared, are there any restrictions? > > > > > > > > > > > > Both file and private pages could be deactived from each active LRU > > > > > > to each inactive LRU if the page has one map_count. In other words, > > > > > > > > > > > > if (page_mapcount(page) <= 1) > > > > > > deactivate_page(page); > > > > > > > > > > Why do we restrict to pages that are single mapped? > > > > > > > > Because page table in one of process shared the page would have access bit > > > > so finally we couldn't reclaim the page. The more process it is shared, > > > > the more fail to reclaim. > > > > > > So what? In other words why should it be restricted solely based on the > > > map count. I can see a reason to restrict based on the access > > > permissions because we do not want to simplify all sorts of side channel > > > attacks but memory reclaim is capable of reclaiming shared pages and so > > > far I haven't heard any sound argument why madvise should skip those. > > > Again if there are any reasons, then document them in the changelog. > > > > I think it makes sense. It could be explained, but it also follows > > established madvise semantics, and I'm not sure it's necessarily > > Minchan's job to re-iterate those. > > > > Sharing isn't exactly transparent to userspace. The kernel does COW, > > ksm etc. When you madvise, you can really only speak for your own > > reference to that memory - "*I* am not using this." > > > > This is in line with other madvise calls: MADV_DONTNEED clears the > > local page table entries and drops the corresponding references, so > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and > > also has explicit mapcount checks before clearing PG_dirty, so again > > shared pages don't get freed. > > Right, being consistent with other madvise syscalls is certainly a way > to go. And I am not pushing one way or another, I just want this to be > documented with a reasoning behind. Consistency is certainly an argument > to use. > > On the other hand these non-destructive madvise operations are quite > different and the shared policy might differ as a result as well. We are > aging objects rather than destroying them after all. Being able to age > a pagecache with a sufficient privileges sounds like a useful usecase to > me. In other words you are able to cause the same effect indirectly > without the madvise operation so it kinda makes sense to allow it in a > more sophisticated way. Right, I don't think it's about permission - as you say, you can do this indirectly. Page reclaim is all about relative page order, so if we thwarted you from demoting some pages, you could instead promote other pages to cause a similar end result. I think it's about intent. You're advising the kernel that *you're* not using this memory and would like to have it cleared out based on that knowledge. You could do the same by simply allocating the new pages and have the kernel sort it out. However, if the kernel sorts it out, it *will* look at other users of the page, and it might decide that other pages are actually colder when considering all users. When you ignore shared state, on the other hand, the pages you advise out could refault right after. And then, not only did you not free up the memory, but you also caused IO that may interfere with bringing in the new data for which you tried to create room in the first place. So I don't think it ever makes sense to override it. But it might be better to drop the explicit mapcount check and instead make the local pte young and call shrink_page_list() without the TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code to handle references and shared pages exactly the same way it would if those pages came fresh off the LRU tail, excluding only the reference from the mapping that we're madvising.
Hi Johannes, On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote: > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote: > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote: > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote: > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote: > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > > > > > > When a process expects no accesses to a certain memory range, it could > > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > > > > > > happens but data should be preserved for future use. This could reduce > > > > > > > > > workingset eviction so it ends up increasing performance. > > > > > > > > > > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > > > > > > to be used in the near future. The hint can help kernel in deciding which > > > > > > > > > pages to evict early during memory pressure. > > > > > > > > > > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > > > > > > head if the page is private because inactive list could be full of > > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > > > > > > active pages unless there is no access until the time. > > > > > > > > > > > > > > > > [I am intentionally not looking at the implementation because below > > > > > > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > > > > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > > > > > > Private/shared? If shared, are there any restrictions? > > > > > > > > > > > > > > Both file and private pages could be deactived from each active LRU > > > > > > > to each inactive LRU if the page has one map_count. In other words, > > > > > > > > > > > > > > if (page_mapcount(page) <= 1) > > > > > > > deactivate_page(page); > > > > > > > > > > > > Why do we restrict to pages that are single mapped? > > > > > > > > > > Because page table in one of process shared the page would have access bit > > > > > so finally we couldn't reclaim the page. The more process it is shared, > > > > > the more fail to reclaim. > > > > > > > > So what? In other words why should it be restricted solely based on the > > > > map count. I can see a reason to restrict based on the access > > > > permissions because we do not want to simplify all sorts of side channel > > > > attacks but memory reclaim is capable of reclaiming shared pages and so > > > > far I haven't heard any sound argument why madvise should skip those. > > > > Again if there are any reasons, then document them in the changelog. > > > > > > I think it makes sense. It could be explained, but it also follows > > > established madvise semantics, and I'm not sure it's necessarily > > > Minchan's job to re-iterate those. > > > > > > Sharing isn't exactly transparent to userspace. The kernel does COW, > > > ksm etc. When you madvise, you can really only speak for your own > > > reference to that memory - "*I* am not using this." > > > > > > This is in line with other madvise calls: MADV_DONTNEED clears the > > > local page table entries and drops the corresponding references, so > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and > > > also has explicit mapcount checks before clearing PG_dirty, so again > > > shared pages don't get freed. > > > > Right, being consistent with other madvise syscalls is certainly a way > > to go. And I am not pushing one way or another, I just want this to be > > documented with a reasoning behind. Consistency is certainly an argument > > to use. > > > > On the other hand these non-destructive madvise operations are quite > > different and the shared policy might differ as a result as well. We are > > aging objects rather than destroying them after all. Being able to age > > a pagecache with a sufficient privileges sounds like a useful usecase to > > me. In other words you are able to cause the same effect indirectly > > without the madvise operation so it kinda makes sense to allow it in a > > more sophisticated way. > > Right, I don't think it's about permission - as you say, you can do > this indirectly. Page reclaim is all about relative page order, so if > we thwarted you from demoting some pages, you could instead promote > other pages to cause a similar end result. > > I think it's about intent. You're advising the kernel that *you're* > not using this memory and would like to have it cleared out based on > that knowledge. You could do the same by simply allocating the new > pages and have the kernel sort it out. However, if the kernel sorts it > out, it *will* look at other users of the page, and it might decide > that other pages are actually colder when considering all users. > > When you ignore shared state, on the other hand, the pages you advise > out could refault right after. And then, not only did you not free up > the memory, but you also caused IO that may interfere with bringing in > the new data for which you tried to create room in the first place. > > So I don't think it ever makes sense to override it. > > But it might be better to drop the explicit mapcount check and instead > make the local pte young and call shrink_page_list() without the ^ old? > TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code > to handle references and shared pages exactly the same way it would if > those pages came fresh off the LRU tail, excluding only the reference > from the mapping that we're madvising. You are confused from the name change. Here, MADV_COLD is deactivating , not pageing out. Therefore, shrink_page_list doesn't matter. And madvise_cold_pte_range already makes the local pte *old*(I guess your saying was typo). I guess that's exactly what Michal wanted: just removing page_mapcount check and defers to decision on normal page reclaim policy: If I didn't miss your intention, it seems you and Michal are on same page. (Please correct me if you want to say something other) I could drop the page_mapcount check at next revision. Thanks for the review!
On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote: > On Fri 31-05-19 23:34:07, Minchan Kim wrote: > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > > > When a process expects no accesses to a certain memory range, it could > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > > > happens but data should be preserved for future use. This could reduce > > > > > > workingset eviction so it ends up increasing performance. > > > > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > > > to be used in the near future. The hint can help kernel in deciding which > > > > > > pages to evict early during memory pressure. > > > > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > > > head if the page is private because inactive list could be full of > > > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > > > active pages unless there is no access until the time. > > > > > > > > > > [I am intentionally not looking at the implementation because below > > > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > > > Private/shared? If shared, are there any restrictions? > > > > > > > > Both file and private pages could be deactived from each active LRU > > > > to each inactive LRU if the page has one map_count. In other words, > > > > > > > > if (page_mapcount(page) <= 1) > > > > deactivate_page(page); > > > > > > Why do we restrict to pages that are single mapped? > > > > Because page table in one of process shared the page would have access bit > > so finally we couldn't reclaim the page. The more process it is shared, > > the more fail to reclaim. > > So what? In other words why should it be restricted solely based on the > map count. I can see a reason to restrict based on the access > permissions because we do not want to simplify all sorts of side channel > attacks but memory reclaim is capable of reclaiming shared pages and so > far I haven't heard any sound argument why madvise should skip those. > Again if there are any reasons, then document them in the changelog. I will go with removing the part so that defer to decision to the VM reclaim based on the review. > > [...] > > > > Please document this, if this is really a desirable semantic because > > > then you have the same set of problems as we've had with the early > > > MADV_FREE implementation mentioned above. > > > > IIRC, the problem of MADV_FREE was that we couldn't discard freeable > > pages because VM never scan anonymous LRU with swapless system. > > However, it's not the our case because we should reclaim them, not > > discarding. > > Right. But there is still the page cache reclaim. Is it expected that > an explicitly cold memory doesn't get reclaimed because we have a > sufficient amount of page cache (a very common case) and we never age > anonymous memory because of that? If there are lots of used-once pages in file-LRU, I think there is no need to reclaim anonymous pages because it needs bigger overhead due to IO. It has been true for a long time in current VM policy. Reclaim preference model based on hints is as following based on cost: MADV_DONTNEED >> MADV_PAGEOUT > used-once pages > MADV_FREE >= MADV_COLD It is desirable for the new hints to be placed in the reclaiming preference order such that a) they don't overlap functionally with existing hints and b) we have a balanced ordering of disruptive and non-disruptive hints.
On Mon 03-06-19 17:50:59, Johannes Weiner wrote: > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote: > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote: > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote: > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote: > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > > > > > > When a process expects no accesses to a certain memory range, it could > > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > > > > > > happens but data should be preserved for future use. This could reduce > > > > > > > > > workingset eviction so it ends up increasing performance. > > > > > > > > > > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > > > > > > to be used in the near future. The hint can help kernel in deciding which > > > > > > > > > pages to evict early during memory pressure. > > > > > > > > > > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > > > > > > head if the page is private because inactive list could be full of > > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > > > > > > active pages unless there is no access until the time. > > > > > > > > > > > > > > > > [I am intentionally not looking at the implementation because below > > > > > > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > > > > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > > > > > > Private/shared? If shared, are there any restrictions? > > > > > > > > > > > > > > Both file and private pages could be deactived from each active LRU > > > > > > > to each inactive LRU if the page has one map_count. In other words, > > > > > > > > > > > > > > if (page_mapcount(page) <= 1) > > > > > > > deactivate_page(page); > > > > > > > > > > > > Why do we restrict to pages that are single mapped? > > > > > > > > > > Because page table in one of process shared the page would have access bit > > > > > so finally we couldn't reclaim the page. The more process it is shared, > > > > > the more fail to reclaim. > > > > > > > > So what? In other words why should it be restricted solely based on the > > > > map count. I can see a reason to restrict based on the access > > > > permissions because we do not want to simplify all sorts of side channel > > > > attacks but memory reclaim is capable of reclaiming shared pages and so > > > > far I haven't heard any sound argument why madvise should skip those. > > > > Again if there are any reasons, then document them in the changelog. > > > > > > I think it makes sense. It could be explained, but it also follows > > > established madvise semantics, and I'm not sure it's necessarily > > > Minchan's job to re-iterate those. > > > > > > Sharing isn't exactly transparent to userspace. The kernel does COW, > > > ksm etc. When you madvise, you can really only speak for your own > > > reference to that memory - "*I* am not using this." > > > > > > This is in line with other madvise calls: MADV_DONTNEED clears the > > > local page table entries and drops the corresponding references, so > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and > > > also has explicit mapcount checks before clearing PG_dirty, so again > > > shared pages don't get freed. > > > > Right, being consistent with other madvise syscalls is certainly a way > > to go. And I am not pushing one way or another, I just want this to be > > documented with a reasoning behind. Consistency is certainly an argument > > to use. > > > > On the other hand these non-destructive madvise operations are quite > > different and the shared policy might differ as a result as well. We are > > aging objects rather than destroying them after all. Being able to age > > a pagecache with a sufficient privileges sounds like a useful usecase to > > me. In other words you are able to cause the same effect indirectly > > without the madvise operation so it kinda makes sense to allow it in a > > more sophisticated way. > > Right, I don't think it's about permission - as you say, you can do > this indirectly. Page reclaim is all about relative page order, so if > we thwarted you from demoting some pages, you could instead promote > other pages to cause a similar end result. There is one notable difference. If we allow an easy way to demote a shared resource _easily_ then we have to think about potential side channel attacks. Sure you can generate a memory pressure to cause the same but that is much harder and impractical in many cases. > I think it's about intent. You're advising the kernel that *you're* > not using this memory and would like to have it cleared out based on > that knowledge. You could do the same by simply allocating the new > pages and have the kernel sort it out. However, if the kernel sorts it > out, it *will* look at other users of the page, and it might decide > that other pages are actually colder when considering all users. > > When you ignore shared state, on the other hand, the pages you advise > out could refault right after. And then, not only did you not free up > the memory, but you also caused IO that may interfere with bringing in > the new data for which you tried to create room in the first place. That is a fair argument and I would tend to agree. On the other hand we are talking about potential usecases which tend to _know_ what they are doing and removing the possibility completely sounds like they will not exploit the existing interface to the maximum. But as already mentioned starting simpler and more restricted is usually a better choice when the semantic is not carved in stone from the very beginning and documented that way. > So I don't think it ever makes sense to override it. > > But it might be better to drop the explicit mapcount check and instead > make the local pte young and call shrink_page_list() without the > TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code > to handle references and shared pages exactly the same way it would if > those pages came fresh off the LRU tail, excluding only the reference > from the mapping that we're madvising. Yeah that makes sense to me.
On Tue 04-06-19 08:02:05, Minchan Kim wrote: > Hi Johannes, > > On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote: > > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote: > > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote: > > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote: > > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote: > > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > > > > > > > When a process expects no accesses to a certain memory range, it could > > > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > > > > > > > happens but data should be preserved for future use. This could reduce > > > > > > > > > > workingset eviction so it ends up increasing performance. > > > > > > > > > > > > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > > > > > > > to be used in the near future. The hint can help kernel in deciding which > > > > > > > > > > pages to evict early during memory pressure. > > > > > > > > > > > > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > > > > > > > head if the page is private because inactive list could be full of > > > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > > > > > > > active pages unless there is no access until the time. > > > > > > > > > > > > > > > > > > [I am intentionally not looking at the implementation because below > > > > > > > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > > > > > > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > > > > > > > Private/shared? If shared, are there any restrictions? > > > > > > > > > > > > > > > > Both file and private pages could be deactived from each active LRU > > > > > > > > to each inactive LRU if the page has one map_count. In other words, > > > > > > > > > > > > > > > > if (page_mapcount(page) <= 1) > > > > > > > > deactivate_page(page); > > > > > > > > > > > > > > Why do we restrict to pages that are single mapped? > > > > > > > > > > > > Because page table in one of process shared the page would have access bit > > > > > > so finally we couldn't reclaim the page. The more process it is shared, > > > > > > the more fail to reclaim. > > > > > > > > > > So what? In other words why should it be restricted solely based on the > > > > > map count. I can see a reason to restrict based on the access > > > > > permissions because we do not want to simplify all sorts of side channel > > > > > attacks but memory reclaim is capable of reclaiming shared pages and so > > > > > far I haven't heard any sound argument why madvise should skip those. > > > > > Again if there are any reasons, then document them in the changelog. > > > > > > > > I think it makes sense. It could be explained, but it also follows > > > > established madvise semantics, and I'm not sure it's necessarily > > > > Minchan's job to re-iterate those. > > > > > > > > Sharing isn't exactly transparent to userspace. The kernel does COW, > > > > ksm etc. When you madvise, you can really only speak for your own > > > > reference to that memory - "*I* am not using this." > > > > > > > > This is in line with other madvise calls: MADV_DONTNEED clears the > > > > local page table entries and drops the corresponding references, so > > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and > > > > also has explicit mapcount checks before clearing PG_dirty, so again > > > > shared pages don't get freed. > > > > > > Right, being consistent with other madvise syscalls is certainly a way > > > to go. And I am not pushing one way or another, I just want this to be > > > documented with a reasoning behind. Consistency is certainly an argument > > > to use. > > > > > > On the other hand these non-destructive madvise operations are quite > > > different and the shared policy might differ as a result as well. We are > > > aging objects rather than destroying them after all. Being able to age > > > a pagecache with a sufficient privileges sounds like a useful usecase to > > > me. In other words you are able to cause the same effect indirectly > > > without the madvise operation so it kinda makes sense to allow it in a > > > more sophisticated way. > > > > Right, I don't think it's about permission - as you say, you can do > > this indirectly. Page reclaim is all about relative page order, so if > > we thwarted you from demoting some pages, you could instead promote > > other pages to cause a similar end result. > > > > I think it's about intent. You're advising the kernel that *you're* > > not using this memory and would like to have it cleared out based on > > that knowledge. You could do the same by simply allocating the new > > pages and have the kernel sort it out. However, if the kernel sorts it > > out, it *will* look at other users of the page, and it might decide > > that other pages are actually colder when considering all users. > > > > When you ignore shared state, on the other hand, the pages you advise > > out could refault right after. And then, not only did you not free up > > the memory, but you also caused IO that may interfere with bringing in > > the new data for which you tried to create room in the first place. > > > > So I don't think it ever makes sense to override it. > > > > But it might be better to drop the explicit mapcount check and instead > > make the local pte young and call shrink_page_list() without the > ^ > old? > > > TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code > > to handle references and shared pages exactly the same way it would if > > those pages came fresh off the LRU tail, excluding only the reference > > from the mapping that we're madvising. > > You are confused from the name change. Here, MADV_COLD is deactivating > , not pageing out. Therefore, shrink_page_list doesn't matter. > And madvise_cold_pte_range already makes the local pte *old*(I guess > your saying was typo). > I guess that's exactly what Michal wanted: just removing page_mapcount > check and defers to decision on normal page reclaim policy: > If I didn't miss your intention, it seems you and Michal are on same page. > (Please correct me if you want to say something other) Indeed. > I could drop the page_mapcount check at next revision. Yes please.
On Tue 04-06-19 13:26:51, Minchan Kim wrote: > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote: [...] > > Right. But there is still the page cache reclaim. Is it expected that > > an explicitly cold memory doesn't get reclaimed because we have a > > sufficient amount of page cache (a very common case) and we never age > > anonymous memory because of that? > > If there are lots of used-once pages in file-LRU, I think there is no > need to reclaim anonymous pages because it needs bigger overhead due to > IO. It has been true for a long time in current VM policy. You are making an assumption which is not universally true. If I _know_ that there is a considerable amount of idle anonymous memory then I would really prefer if it goes to the swap rather than make a pressure on caching. Inactive list is not guaranteed to contain only used-once pages, right? Anyway, as already mentioned, we can start with a simpler implementation for now and explicitly note that pagecache biased reclaim is known to be a problem potentially. I am pretty sure somebody will come sooner or later and we can address the problem then with some good numbers to back the additional complexity.
On Tue, Jun 04, 2019 at 08:02:05AM +0900, Minchan Kim wrote: > Hi Johannes, > > On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote: > > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote: > > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote: > > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote: > > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote: > > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > > > > > > > When a process expects no accesses to a certain memory range, it could > > > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > > > > > > > happens but data should be preserved for future use. This could reduce > > > > > > > > > > workingset eviction so it ends up increasing performance. > > > > > > > > > > > > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > > > > > > > to be used in the near future. The hint can help kernel in deciding which > > > > > > > > > > pages to evict early during memory pressure. > > > > > > > > > > > > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > > > > > > > head if the page is private because inactive list could be full of > > > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > > > > > > > active pages unless there is no access until the time. > > > > > > > > > > > > > > > > > > [I am intentionally not looking at the implementation because below > > > > > > > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > > > > > > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > > > > > > > Private/shared? If shared, are there any restrictions? > > > > > > > > > > > > > > > > Both file and private pages could be deactived from each active LRU > > > > > > > > to each inactive LRU if the page has one map_count. In other words, > > > > > > > > > > > > > > > > if (page_mapcount(page) <= 1) > > > > > > > > deactivate_page(page); > > > > > > > > > > > > > > Why do we restrict to pages that are single mapped? > > > > > > > > > > > > Because page table in one of process shared the page would have access bit > > > > > > so finally we couldn't reclaim the page. The more process it is shared, > > > > > > the more fail to reclaim. > > > > > > > > > > So what? In other words why should it be restricted solely based on the > > > > > map count. I can see a reason to restrict based on the access > > > > > permissions because we do not want to simplify all sorts of side channel > > > > > attacks but memory reclaim is capable of reclaiming shared pages and so > > > > > far I haven't heard any sound argument why madvise should skip those. > > > > > Again if there are any reasons, then document them in the changelog. > > > > > > > > I think it makes sense. It could be explained, but it also follows > > > > established madvise semantics, and I'm not sure it's necessarily > > > > Minchan's job to re-iterate those. > > > > > > > > Sharing isn't exactly transparent to userspace. The kernel does COW, > > > > ksm etc. When you madvise, you can really only speak for your own > > > > reference to that memory - "*I* am not using this." > > > > > > > > This is in line with other madvise calls: MADV_DONTNEED clears the > > > > local page table entries and drops the corresponding references, so > > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and > > > > also has explicit mapcount checks before clearing PG_dirty, so again > > > > shared pages don't get freed. > > > > > > Right, being consistent with other madvise syscalls is certainly a way > > > to go. And I am not pushing one way or another, I just want this to be > > > documented with a reasoning behind. Consistency is certainly an argument > > > to use. > > > > > > On the other hand these non-destructive madvise operations are quite > > > different and the shared policy might differ as a result as well. We are > > > aging objects rather than destroying them after all. Being able to age > > > a pagecache with a sufficient privileges sounds like a useful usecase to > > > me. In other words you are able to cause the same effect indirectly > > > without the madvise operation so it kinda makes sense to allow it in a > > > more sophisticated way. > > > > Right, I don't think it's about permission - as you say, you can do > > this indirectly. Page reclaim is all about relative page order, so if > > we thwarted you from demoting some pages, you could instead promote > > other pages to cause a similar end result. > > > > I think it's about intent. You're advising the kernel that *you're* > > not using this memory and would like to have it cleared out based on > > that knowledge. You could do the same by simply allocating the new > > pages and have the kernel sort it out. However, if the kernel sorts it > > out, it *will* look at other users of the page, and it might decide > > that other pages are actually colder when considering all users. > > > > When you ignore shared state, on the other hand, the pages you advise > > out could refault right after. And then, not only did you not free up > > the memory, but you also caused IO that may interfere with bringing in > > the new data for which you tried to create room in the first place. > > > > So I don't think it ever makes sense to override it. > > > > But it might be better to drop the explicit mapcount check and instead > > make the local pte young and call shrink_page_list() without the > ^ > old? Ah yes, of course. Clear the reference bit. > > TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code > > to handle references and shared pages exactly the same way it would if > > those pages came fresh off the LRU tail, excluding only the reference > > from the mapping that we're madvising. > > You are confused from the name change. Here, MADV_COLD is deactivating > , not pageing out. Therefore, shrink_page_list doesn't matter. > And madvise_cold_pte_range already makes the local pte *old*(I guess > your saying was typo). > I guess that's exactly what Michal wanted: just removing page_mapcount > check and defers to decision on normal page reclaim policy: > If I didn't miss your intention, it seems you and Michal are on same page. > (Please correct me if you want to say something other) > I could drop the page_mapcount check at next revision. Sorry, I was actually talking about the MADV_PAGEOUT patch in this case, since this turned into a general discussion about how shared pages should be handled, which applies to both operations. My argument was for removing the check in both patches, yes, but to additionally change the pageout patch to 1) make the advised pte old and then 2) call shrink_page_list() WITHOUT ignore_access/references so that it respects references from other mappings, if any.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 9f8712a4b1a5..58b06654c8dd 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page) TESTPAGEFLAG(Young, young, PF_ANY) SETPAGEFLAG(Young, young, PF_ANY) TESTCLEARFLAG(Young, young, PF_ANY) +CLEARPAGEFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index 1e894d34bdce..f3f43b317150 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page) SetPageYoung(page); } +static inline void clear_page_young(struct page *page) +{ + ClearPageYoung(page); +} + static inline bool test_and_clear_page_young(struct page *page) { return TestClearPageYoung(page); @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page) set_bit(PAGE_EXT_YOUNG, &page_ext->flags); } +static void clear_page_young(struct page *page) +{ + struct page_ext *page_ext = lookup_page_ext(page); + + if (unlikely(!page_ext)) + return; + + clear_bit(PAGE_EXT_YOUNG, &page_ext->flags); +} + static inline bool test_and_clear_page_young(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); diff --git a/include/linux/swap.h b/include/linux/swap.h index de2c67a33b7e..0ce997edb8bb 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu); extern void lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_file_page(struct page *page); +extern void deactivate_page(struct page *page); extern void mark_page_lazyfree(struct page *page); extern void swap_setup(void); diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index bea0278f65ab..1190f4e7f7b9 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -43,6 +43,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_COLD 5 /* deactivatie these pages */ /* common parameters: try to keep these consistent across architectures */ #define MADV_FREE 8 /* free pages only if memory pressure */ diff --git a/mm/madvise.c b/mm/madvise.c index 628022e674a7..bff150eab6da 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_REMOVE: case MADV_WILLNEED: case MADV_DONTNEED: + case MADV_COLD: case MADV_FREE: return 0; default: @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma, return 0; } +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) +{ + pte_t *orig_pte, *pte, ptent; + spinlock_t *ptl; + struct page *page; + struct vm_area_struct *vma = walk->vma; + unsigned long next; + + next = pmd_addr_end(addr, end); + if (pmd_trans_huge(*pmd)) { + ptl = pmd_trans_huge_lock(pmd, vma); + if (!ptl) + return 0; + + if (is_huge_zero_pmd(*pmd)) + goto huge_unlock; + + page = pmd_page(*pmd); + if (page_mapcount(page) > 1) + goto huge_unlock; + + if (next - addr != HPAGE_PMD_SIZE) { + int err; + + get_page(page); + spin_unlock(ptl); + lock_page(page); + err = split_huge_page(page); + unlock_page(page); + put_page(page); + if (!err) + goto regular_page; + return 0; + } + + pmdp_test_and_clear_young(vma, addr, pmd); + deactivate_page(page); +huge_unlock: + spin_unlock(ptl); + return 0; + } + + if (pmd_trans_unstable(pmd)) + return 0; + +regular_page: + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) { + ptent = *pte; + + if (pte_none(ptent)) + continue; + + if (!pte_present(ptent)) + continue; + + page = vm_normal_page(vma, addr, ptent); + if (!page) + continue; + + if (page_mapcount(page) > 1) + continue; + + ptep_test_and_clear_young(vma, addr, pte); + deactivate_page(page); + } + + pte_unmap_unlock(orig_pte, ptl); + cond_resched(); + + return 0; +} + +static void madvise_cold_page_range(struct mmu_gather *tlb, + struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + struct mm_walk cool_walk = { + .pmd_entry = madvise_cold_pte_range, + .mm = vma->vm_mm, + }; + + tlb_start_vma(tlb, vma); + walk_page_range(addr, end, &cool_walk); + tlb_end_vma(tlb, vma); +} + +static long madvise_cold(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start_addr, unsigned long end_addr) +{ + struct mm_struct *mm = vma->vm_mm; + struct mmu_gather tlb; + + *prev = vma; + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) + return -EINVAL; + + lru_add_drain(); + tlb_gather_mmu(&tlb, mm, start_addr, end_addr); + madvise_cold_page_range(&tlb, vma, start_addr, end_addr); + tlb_finish_mmu(&tlb, start_addr, end_addr); + + return 0; +} + static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) @@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, return madvise_remove(vma, prev, start, end); case MADV_WILLNEED: return madvise_willneed(vma, prev, start, end); + case MADV_COLD: + return madvise_cold(vma, prev, start, end); case MADV_FREE: case MADV_DONTNEED: return madvise_dontneed_free(vma, prev, start, end, behavior); @@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior) case MADV_WILLNEED: case MADV_DONTNEED: case MADV_FREE: + case MADV_COLD: #ifdef CONFIG_KSM case MADV_MERGEABLE: case MADV_UNMERGEABLE: diff --git a/mm/swap.c b/mm/swap.c index 7b079976cbec..cebedab15aa2 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -47,6 +47,7 @@ int page_cluster; static DEFINE_PER_CPU(struct pagevec, lru_add_pvec); static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs); static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs); +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs); static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs); #ifdef CONFIG_SMP static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs); @@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, update_page_reclaim_stat(lruvec, file, 0); } +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec, + void *arg) +{ + if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + int file = page_is_file_cache(page); + int lru = page_lru_base_type(page); + + del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE); + ClearPageActive(page); + ClearPageReferenced(page); + clear_page_young(page); + add_page_to_lru_list(page, lruvec, lru); + + __count_vm_events(PGDEACTIVATE, hpage_nr_pages(page)); + update_page_reclaim_stat(lruvec, file, 0); + } +} static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec, void *arg) @@ -590,6 +608,10 @@ void lru_add_drain_cpu(int cpu) if (pagevec_count(pvec)) pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL); + pvec = &per_cpu(lru_deactivate_pvecs, cpu); + if (pagevec_count(pvec)) + pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); + pvec = &per_cpu(lru_lazyfree_pvecs, cpu); if (pagevec_count(pvec)) pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL); @@ -623,6 +645,26 @@ void deactivate_file_page(struct page *page) } } +/* + * deactivate_page - deactivate a page + * @page: page to deactivate + * + * deactivate_page() moves @page to the inactive list if @page was on the active + * list and was not an unevictable page. This is done to accelerate the reclaim + * of @page. + */ +void deactivate_page(struct page *page) +{ + if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs); + + get_page(page); + if (!pagevec_add(pvec, page) || PageCompound(page)) + pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); + put_cpu_var(lru_deactivate_pvecs); + } +} + /** * mark_page_lazyfree - make an anon page lazyfree * @page: page to deactivate @@ -687,6 +729,7 @@ void lru_add_drain_all(void) if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) || need_activate_page_drain(cpu)) { INIT_WORK(work, lru_add_drain_per_cpu);
When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. Internally, it works via deactivating pages from active list to inactive's head if the page is private because inactive list could be full of used-once pages which are first candidate for the reclaiming and that's a reason why MADV_FREE move pages to head of inactive LRU list. Therefore, if the memory pressure happens, they will be reclaimed earlier than other active pages unless there is no access until the time. * RFCv1 * renaming from MADV_COOL to MADV_COLD - hannes * internal review * use clear_page_youn in deactivate_page - joelaf * Revise the description - surenb * Renaming from MADV_WARM to MADV_COOL - surenb Signed-off-by: Minchan Kim <minchan@kernel.org> --- include/linux/page-flags.h | 1 + include/linux/page_idle.h | 15 ++++ include/linux/swap.h | 1 + include/uapi/asm-generic/mman-common.h | 1 + mm/madvise.c | 111 +++++++++++++++++++++++++ mm/swap.c | 43 ++++++++++ 6 files changed, 172 insertions(+)