Message ID | 20240417160842.76665-1-ryncsn@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/swap: optimize swap cache search space | expand |
Hi, Kairui, Kairui Song <ryncsn@gmail.com> writes: > From: Kairui Song <kasong@tencent.com> > > Currently we use one swap_address_space for every 64M chunk to reduce lock > contention, this is like having a set of smaller swap files inside one > big swap file. But when doing swap cache look up or insert, we are > still using the offset of the whole large swap file. This is OK for > correctness, as the offset (key) is unique. > > But Xarray is specially optimized for small indexes, it creates the > redix tree levels lazily to be just enough to fit the largest key > stored in one Xarray. So we are wasting tree nodes unnecessarily. > > For 64M chunk it should only take at most 3 level to contain everything. > But we are using the offset from the whole swap file, so the offset (key) > value will be way beyond 64M, and so will the tree level. > > Optimize this by reduce the swap cache search space into 64M scope. In general, I think that it makes sense to reduce the depth of the xarray. One concern is that IIUC we make swap cache behaves like file cache if possible. And your change makes swap cache and file cache diverge more. Is it possible for us to keep them similar? For example, Is it possible to return the offset inside 64M range in __page_file_index() (maybe rename it)? Is it possible to add "start_offset" support in xarray, so "index" will subtract "start_offset" before looking up / inserting? Is it possible to use multiple range locks to protect one xarray to improve the lock scalability? This is why we have multiple "struct address_space" for one swap device. And, we may have same lock contention issue for large files too. I haven't look at the code in details. So, my idea may not make sense at all. If so, sorry about that. Hi, Matthew, Can you teach me on this too? -- Best Regards, Huang, Ying
On Mon, Apr 22, 2024 at 3:56 PM Huang, Ying <ying.huang@intel.com> wrote: > > Hi, Kairui, > > Kairui Song <ryncsn@gmail.com> writes: > > > From: Kairui Song <kasong@tencent.com> > > > > Currently we use one swap_address_space for every 64M chunk to reduce lock > > contention, this is like having a set of smaller swap files inside one > > big swap file. But when doing swap cache look up or insert, we are > > still using the offset of the whole large swap file. This is OK for > > correctness, as the offset (key) is unique. > > > > But Xarray is specially optimized for small indexes, it creates the > > redix tree levels lazily to be just enough to fit the largest key > > stored in one Xarray. So we are wasting tree nodes unnecessarily. > > > > For 64M chunk it should only take at most 3 level to contain everything. > > But we are using the offset from the whole swap file, so the offset (key) > > value will be way beyond 64M, and so will the tree level. > > > > Optimize this by reduce the swap cache search space into 64M scope. > Hi, Thanks for the comments! > In general, I think that it makes sense to reduce the depth of the > xarray. > > One concern is that IIUC we make swap cache behaves like file cache if > possible. And your change makes swap cache and file cache diverge more. > Is it possible for us to keep them similar? So far in this series, I think there is no problem for that, the two main helpers for retrieving file & cache offset: folio_index and folio_file_pos will work fine and be compatible with current users. And if we convert to share filemap_* functions for swap cache / page cache, they are mostly already accepting index as an argument so no trouble at all. > > For example, > > Is it possible to return the offset inside 64M range in > __page_file_index() (maybe rename it)? Not sure what you mean by this, __page_file_index will be gone as we convert to folio. And this series did delete / rename it (it might not be easy to see this, the usage of these helpers is not very well organized before this series so some clean up is involved). It was previously only used through page_index (deleted) / folio_index, and, now folio_index will be returning the offset inside the 64M range. I guess I just did what you wanted? :) My cover letter and commit message might be not clear enough, I can update it. > > Is it possible to add "start_offset" support in xarray, so "index" > will subtract "start_offset" before looking up / inserting? xarray struct seems already very full, and this usage doesn't look generic to me, might be better to fix this kind of issue case by case. > > Is it possible to use multiple range locks to protect one xarray to > improve the lock scalability? This is why we have multiple "struct > address_space" for one swap device. And, we may have same lock > contention issue for large files too. Good question, this series can improve the tree depth issue for swap cache, but contention in address space is still a thing. A more generic solution might involve changes of xarray API or use some other data struct? (BTW I think reducing the search space and resolving lock contention is not necessarily related, reducing the search space by having a large table of small trees should still perform better for swap cache). > > I haven't look at the code in details. So, my idea may not make sense > at all. If so, sorry about that. > > Hi, Matthew, > > Can you teach me on this too? > > -- > Best Regards, > Huang, Ying
Kairui Song <ryncsn@gmail.com> writes: > On Mon, Apr 22, 2024 at 3:56 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Hi, Kairui, >> >> Kairui Song <ryncsn@gmail.com> writes: >> >> > From: Kairui Song <kasong@tencent.com> >> > >> > Currently we use one swap_address_space for every 64M chunk to reduce lock >> > contention, this is like having a set of smaller swap files inside one >> > big swap file. But when doing swap cache look up or insert, we are >> > still using the offset of the whole large swap file. This is OK for >> > correctness, as the offset (key) is unique. >> > >> > But Xarray is specially optimized for small indexes, it creates the >> > redix tree levels lazily to be just enough to fit the largest key >> > stored in one Xarray. So we are wasting tree nodes unnecessarily. >> > >> > For 64M chunk it should only take at most 3 level to contain everything. >> > But we are using the offset from the whole swap file, so the offset (key) >> > value will be way beyond 64M, and so will the tree level. >> > >> > Optimize this by reduce the swap cache search space into 64M scope. >> > > Hi, > > Thanks for the comments! > >> In general, I think that it makes sense to reduce the depth of the >> xarray. >> >> One concern is that IIUC we make swap cache behaves like file cache if >> possible. And your change makes swap cache and file cache diverge more. >> Is it possible for us to keep them similar? > > So far in this series, I think there is no problem for that, the two > main helpers for retrieving file & cache offset: folio_index and > folio_file_pos will work fine and be compatible with current users. > > And if we convert to share filemap_* functions for swap cache / page > cache, they are mostly already accepting index as an argument so no > trouble at all. > >> >> For example, >> >> Is it possible to return the offset inside 64M range in >> __page_file_index() (maybe rename it)? > > Not sure what you mean by this, __page_file_index will be gone as we > convert to folio. > And this series did delete / rename it (it might not be easy to see > this, the usage of these helpers is not very well organized before > this series so some clean up is involved). > It was previously only used through page_index (deleted) / > folio_index, and, now folio_index will be returning the offset inside > the 64M range. > > I guess I just did what you wanted? :) Good! > My cover letter and commit message might be not clear enough, I can update it. > >> >> Is it possible to add "start_offset" support in xarray, so "index" >> will subtract "start_offset" before looking up / inserting? > > xarray struct seems already very full, and this usage doesn't look > generic to me, might be better to fix this kind of issue case by case. Just some open question. >> >> Is it possible to use multiple range locks to protect one xarray to >> improve the lock scalability? This is why we have multiple "struct >> address_space" for one swap device. And, we may have same lock >> contention issue for large files too. > > Good question, this series can improve the tree depth issue for swap > cache, but contention in address space is still a thing. The lock contention for swap cache has been reduced via using multiple xarray in commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). But it fixes that for swap cache only, not for file cache in general. We have observed similar lock contention issue for file cache too. And the method isn't perfect too, like the issue you found here. In general, it's about what is "file" for swap device. > A more generic solution might involve changes of xarray API or use > some other data struct? > > (BTW I think reducing the search space and resolving lock contention > is not necessarily related, reducing the search space by having a > large table of small trees should still perform better for swap > cache). -- Best Regards, Huang, Ying
On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote: > Is it possible to add "start_offset" support in xarray, so "index" > will subtract "start_offset" before looking up / inserting? We kind of have that with XA_FLAGS_ZERO_BUSY which is used for XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could generalise it, but then we'd have to store that somewhere and there's no obvious good place to store it that wouldn't enlarge struct xarray, which I'd be reluctant to do. > Is it possible to use multiple range locks to protect one xarray to > improve the lock scalability? This is why we have multiple "struct > address_space" for one swap device. And, we may have same lock > contention issue for large files too. It's something I've considered. The issue is search marks. If we delete an entry, we may have to walk all the way up the xarray clearing bits as we go and I'd rather not grab a lock at each level. There's a convenient 4 byte hole between nr_values and parent where we could put it. Oh, another issue is that we use i_pages.xa_lock to synchronise address_space.nrpages, so I'm not sure that a per-node lock will help. But I'm conscious that there are workloads which show contention on xa_lock as their limiting factor, so I'm open to ideas to improve all these things.
Hi, Matthew, Matthew Wilcox <willy@infradead.org> writes: > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote: >> Is it possible to add "start_offset" support in xarray, so "index" >> will subtract "start_offset" before looking up / inserting? > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could > generalise it, but then we'd have to store that somewhere and there's > no obvious good place to store it that wouldn't enlarge struct xarray, > which I'd be reluctant to do. > >> Is it possible to use multiple range locks to protect one xarray to >> improve the lock scalability? This is why we have multiple "struct >> address_space" for one swap device. And, we may have same lock >> contention issue for large files too. > > It's something I've considered. The issue is search marks. If we delete > an entry, we may have to walk all the way up the xarray clearing bits as > we go and I'd rather not grab a lock at each level. There's a convenient > 4 byte hole between nr_values and parent where we could put it. > > Oh, another issue is that we use i_pages.xa_lock to synchronise > address_space.nrpages, so I'm not sure that a per-node lock will help. Thanks for looking at this. > But I'm conscious that there are workloads which show contention on > xa_lock as their limiting factor, so I'm open to ideas to improve all > these things. I have no idea so far because my very limited knowledge about xarray. -- Best Regards, Huang, Ying
Hi Ying, On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote: > > Hi, Matthew, > > Matthew Wilcox <willy@infradead.org> writes: > > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote: > >> Is it possible to add "start_offset" support in xarray, so "index" > >> will subtract "start_offset" before looking up / inserting? > > > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for > > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could > > generalise it, but then we'd have to store that somewhere and there's > > no obvious good place to store it that wouldn't enlarge struct xarray, > > which I'd be reluctant to do. > > > >> Is it possible to use multiple range locks to protect one xarray to > >> improve the lock scalability? This is why we have multiple "struct > >> address_space" for one swap device. And, we may have same lock > >> contention issue for large files too. > > > > It's something I've considered. The issue is search marks. If we delete > > an entry, we may have to walk all the way up the xarray clearing bits as > > we go and I'd rather not grab a lock at each level. There's a convenient > > 4 byte hole between nr_values and parent where we could put it. > > > > Oh, another issue is that we use i_pages.xa_lock to synchronise > > address_space.nrpages, so I'm not sure that a per-node lock will help. > > Thanks for looking at this. > > > But I'm conscious that there are workloads which show contention on > > xa_lock as their limiting factor, so I'm open to ideas to improve all > > these things. > > I have no idea so far because my very limited knowledge about xarray. For the swap file usage, I have been considering an idea to remove the index part of the xarray from swap cache. Swap cache is different from file cache in a few aspects. For one if we want to have a folio equivalent of "large swap entry". Then the natural alignment of those swap offset on does not make sense. Ideally we should be able to write the folio to un-aligned swap file locations. The other aspect for swap files is that, we already have different data structures organized around swap offset, swap_map and swap_cgroup. If we group the swap related data structure together. We can add a pointer to a union of folio or a shadow swap entry. We can use atomic updates on the swap struct member or breakdown the access lock by ranges just like swap cluster does. I want to discuss those ideas in the upcoming LSF/MM meet up as well. Chris
Chris Li <chrisl@kernel.org> writes: > Hi Ying, > > On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Hi, Matthew, >> >> Matthew Wilcox <willy@infradead.org> writes: >> >> > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote: >> >> Is it possible to add "start_offset" support in xarray, so "index" >> >> will subtract "start_offset" before looking up / inserting? >> > >> > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for >> > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could >> > generalise it, but then we'd have to store that somewhere and there's >> > no obvious good place to store it that wouldn't enlarge struct xarray, >> > which I'd be reluctant to do. >> > >> >> Is it possible to use multiple range locks to protect one xarray to >> >> improve the lock scalability? This is why we have multiple "struct >> >> address_space" for one swap device. And, we may have same lock >> >> contention issue for large files too. >> > >> > It's something I've considered. The issue is search marks. If we delete >> > an entry, we may have to walk all the way up the xarray clearing bits as >> > we go and I'd rather not grab a lock at each level. There's a convenient >> > 4 byte hole between nr_values and parent where we could put it. >> > >> > Oh, another issue is that we use i_pages.xa_lock to synchronise >> > address_space.nrpages, so I'm not sure that a per-node lock will help. >> >> Thanks for looking at this. >> >> > But I'm conscious that there are workloads which show contention on >> > xa_lock as their limiting factor, so I'm open to ideas to improve all >> > these things. >> >> I have no idea so far because my very limited knowledge about xarray. > > For the swap file usage, I have been considering an idea to remove the > index part of the xarray from swap cache. Swap cache is different from > file cache in a few aspects. > For one if we want to have a folio equivalent of "large swap entry". > Then the natural alignment of those swap offset on does not make > sense. Ideally we should be able to write the folio to un-aligned swap > file locations. > > The other aspect for swap files is that, we already have different > data structures organized around swap offset, swap_map and > swap_cgroup. If we group the swap related data structure together. We > can add a pointer to a union of folio or a shadow swap entry. The shadow swap entry may be freed. So we need to prepare for that. And, in current design, only swap_map[] is allocated if the swap space isn't used. That needs to be considered too. > We can use atomic updates on the swap struct member or breakdown the > access lock by ranges just like swap cluster does. The swap code uses xarray in a simple way. That gives us opportunity to optimize. For example, it makes it easy to use multiple xarray instances for one swap device. > I want to discuss those ideas in the upcoming LSF/MM meet up as well. Good! -- Best Regards, Huang, Ying
On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > Hi Ying, > > > > For the swap file usage, I have been considering an idea to remove the > > index part of the xarray from swap cache. Swap cache is different from > > file cache in a few aspects. > > For one if we want to have a folio equivalent of "large swap entry". > > Then the natural alignment of those swap offset on does not make > > sense. Ideally we should be able to write the folio to un-aligned swap > > file locations. > > > > The other aspect for swap files is that, we already have different > > data structures organized around swap offset, swap_map and > > swap_cgroup. If we group the swap related data structure together. We > > can add a pointer to a union of folio or a shadow swap entry. > > The shadow swap entry may be freed. So we need to prepare for that. Free the shadow swap entry will just set the pointer to NULL. Are you concerned that the memory allocated for the pointer is not free to the system after the shadow swap entry is free? It will be subject to fragmentation on the free swap entry. In that regard, xarray is also subject to fragmentation. It will not free the internal node if the node has one xa_index not freed. Even if the xarray node is freed to slab, at slab level there is fragmentation as well, the backing page might not free to the system. > And, in current design, only swap_map[] is allocated if the swap space > isn't used. That needs to be considered too. I am aware of that. I want to make the swap_map[] not static allocated any more either. The swap_map static allocation forces the rest of the swap data structure to have other means to sparsely allocate their data structure, repeating the fragmentation elsewhere, in different ways.That is also the one major source of the pain point hacking on the swap code. The data structure is spread into too many different places. > > We can use atomic updates on the swap struct member or breakdown the > > access lock by ranges just like swap cluster does. > > The swap code uses xarray in a simple way. That gives us opportunity to > optimize. For example, it makes it easy to use multiple xarray The fixed swap offset range makes it like an array. There are many ways to shard the array like swap entry, e.g. swap cluster is one way to shard it. Multiple xarray is another way. We can also do multiple xarray like sharding, or even more fancy ones. Chris
Chris Li <chrisl@kernel.org> writes: > On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > Hi Ying, >> > >> > For the swap file usage, I have been considering an idea to remove the >> > index part of the xarray from swap cache. Swap cache is different from >> > file cache in a few aspects. >> > For one if we want to have a folio equivalent of "large swap entry". >> > Then the natural alignment of those swap offset on does not make >> > sense. Ideally we should be able to write the folio to un-aligned swap >> > file locations. >> > >> > The other aspect for swap files is that, we already have different >> > data structures organized around swap offset, swap_map and >> > swap_cgroup. If we group the swap related data structure together. We >> > can add a pointer to a union of folio or a shadow swap entry. >> >> The shadow swap entry may be freed. So we need to prepare for that. > > Free the shadow swap entry will just set the pointer to NULL. > Are you concerned that the memory allocated for the pointer is not > free to the system after the shadow swap entry is free? > > It will be subject to fragmentation on the free swap entry. > In that regard, xarray is also subject to fragmentation. It will not > free the internal node if the node has one xa_index not freed. Even if > the xarray node is freed to slab, at slab level there is fragmentation > as well, the backing page might not free to the system. Sorry my words were confusing. What I wanted to say is that the xarray node may be freed. >> And, in current design, only swap_map[] is allocated if the swap space >> isn't used. That needs to be considered too. > > I am aware of that. I want to make the swap_map[] not static allocated > any more either. Yes. That's possible. > The swap_map static allocation forces the rest of the swap data > structure to have other means to sparsely allocate their data > structure, repeating the fragmentation elsewhere, in different > ways.That is also the one major source of the pain point hacking on > the swap code. The data structure is spread into too many different > places. Look forward to more details to compare :-) >> > We can use atomic updates on the swap struct member or breakdown the >> > access lock by ranges just like swap cluster does. >> >> The swap code uses xarray in a simple way. That gives us opportunity to >> optimize. For example, it makes it easy to use multiple xarray > > The fixed swap offset range makes it like an array. There are many > ways to shard the array like swap entry, e.g. swap cluster is one way > to shard it. Multiple xarray is another way. We can also do multiple > xarray like sharding, or even more fancy ones. -- Best Regards, Huang, Ying
On Sat, Apr 27, 2024 at 8:23 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > > Free the shadow swap entry will just set the pointer to NULL. > > Are you concerned that the memory allocated for the pointer is not > > free to the system after the shadow swap entry is free? > > > > It will be subject to fragmentation on the free swap entry. > > In that regard, xarray is also subject to fragmentation. It will not > > free the internal node if the node has one xa_index not freed. Even if > > the xarray node is freed to slab, at slab level there is fragmentation > > as well, the backing page might not free to the system. > > Sorry my words were confusing. What I wanted to say is that the xarray > node may be freed. Somehow I get that is what you mean :-) My previous reply still applies here. The xarray node freeing will be subject to the fragmentation at slab level. The actual backing page might not release to the kernel after the node freeing. > > >> And, in current design, only swap_map[] is allocated if the swap space > >> isn't used. That needs to be considered too. > > > > I am aware of that. I want to make the swap_map[] not static allocated > > any more either. > > Yes. That's possible. Of course there will be a price to pay for that. The current swap_map is only 1 byte per entry. That swap map count size per swap entry is going to be hard to beat in the alternatives. Hopefully find the trade off in other places. > > > The swap_map static allocation forces the rest of the swap data > > structure to have other means to sparsely allocate their data > > structure, repeating the fragmentation elsewhere, in different > > ways.That is also the one major source of the pain point hacking on > > the swap code. The data structure is spread into too many different > > places. > > Look forward to more details to compare :-) Sure. When I make more progress I will post it. Chris
On Sat, Apr 27, 2024 at 7:16 AM Chris Li <chrisl@kernel.org> wrote: > > Hi Ying, > > On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote: > > > > Hi, Matthew, > > > > Matthew Wilcox <willy@infradead.org> writes: > > > > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote: > > >> Is it possible to add "start_offset" support in xarray, so "index" > > >> will subtract "start_offset" before looking up / inserting? > > > > > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for > > > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could > > > generalise it, but then we'd have to store that somewhere and there's > > > no obvious good place to store it that wouldn't enlarge struct xarray, > > > which I'd be reluctant to do. > > > > > >> Is it possible to use multiple range locks to protect one xarray to > > >> improve the lock scalability? This is why we have multiple "struct > > >> address_space" for one swap device. And, we may have same lock > > >> contention issue for large files too. > > > > > > It's something I've considered. The issue is search marks. If we delete > > > an entry, we may have to walk all the way up the xarray clearing bits as > > > we go and I'd rather not grab a lock at each level. There's a convenient > > > 4 byte hole between nr_values and parent where we could put it. > > > > > > Oh, another issue is that we use i_pages.xa_lock to synchronise > > > address_space.nrpages, so I'm not sure that a per-node lock will help. > > > > Thanks for looking at this. > > > > > But I'm conscious that there are workloads which show contention on > > > xa_lock as their limiting factor, so I'm open to ideas to improve all > > > these things. > > > > I have no idea so far because my very limited knowledge about xarray. > > For the swap file usage, I have been considering an idea to remove the > index part of the xarray from swap cache. Swap cache is different from > file cache in a few aspects. > For one if we want to have a folio equivalent of "large swap entry". > Then the natural alignment of those swap offset on does not make > sense. Ideally we should be able to write the folio to un-aligned swap > file locations. > Hi Chris, This sound interesting, I have a few questions though... Are you suggesting we handle swap on file and swap on device differently? Swap on file is much less frequently used than swap on device I think. And what do you mean "index part of the xarray"? If we need a cache, xarray still seems one of the best choices to hold the content. > The other aspect for swap files is that, we already have different > data structures organized around swap offset, swap_map and > swap_cgroup. If we group the swap related data structure together. We > can add a pointer to a union of folio or a shadow swap entry. We can > use atomic updates on the swap struct member or breakdown the access > lock by ranges just like swap cluster does. > > I want to discuss those ideas in the upcoming LSF/MM meet up as well. Looking forward to it! > > Chris
On Mon, Apr 29, 2024 at 1:37 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Sat, Apr 27, 2024 at 7:16 AM Chris Li <chrisl@kernel.org> wrote: > > > > Hi Ying, > > > > On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote: > > > > > > Hi, Matthew, > > > > > > Matthew Wilcox <willy@infradead.org> writes: > > > > > > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote: > > > >> Is it possible to add "start_offset" support in xarray, so "index" > > > >> will subtract "start_offset" before looking up / inserting? > > > > > > > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for > > > > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could > > > > generalise it, but then we'd have to store that somewhere and there's > > > > no obvious good place to store it that wouldn't enlarge struct xarray, > > > > which I'd be reluctant to do. > > > > > > > >> Is it possible to use multiple range locks to protect one xarray to > > > >> improve the lock scalability? This is why we have multiple "struct > > > >> address_space" for one swap device. And, we may have same lock > > > >> contention issue for large files too. > > > > > > > > It's something I've considered. The issue is search marks. If we delete > > > > an entry, we may have to walk all the way up the xarray clearing bits as > > > > we go and I'd rather not grab a lock at each level. There's a convenient > > > > 4 byte hole between nr_values and parent where we could put it. > > > > > > > > Oh, another issue is that we use i_pages.xa_lock to synchronise > > > > address_space.nrpages, so I'm not sure that a per-node lock will help. > > > > > > Thanks for looking at this. > > > > > > > But I'm conscious that there are workloads which show contention on > > > > xa_lock as their limiting factor, so I'm open to ideas to improve all > > > > these things. > > > > > > I have no idea so far because my very limited knowledge about xarray. > > > > For the swap file usage, I have been considering an idea to remove the > > index part of the xarray from swap cache. Swap cache is different from > > file cache in a few aspects. > > For one if we want to have a folio equivalent of "large swap entry". > > Then the natural alignment of those swap offset on does not make > > sense. Ideally we should be able to write the folio to un-aligned swap > > file locations. > > > > Hi Chris, > > This sound interesting, I have a few questions though... > > Are you suggesting we handle swap on file and swap on device > differently? Swap on file is much less frequently used than swap on > device I think. > > And what do you mean "index part of the xarray"? If we need a cache, > xarray still seems one of the best choices to hold the content. > > > The other aspect for swap files is that, we already have different > > data structures organized around swap offset, swap_map and > > swap_cgroup. If we group the swap related data structure together. We > > can add a pointer to a union of folio or a shadow swap entry. We can > > use atomic updates on the swap struct member or breakdown the access > > lock by ranges just like swap cluster does. Oh, and BTW I'm also trying to breakdown the swap address space range (from 64M to 16M, SWAP_ADDRESS_SPACE_SHIFT from 14 to 12). It's a simple approach, but the coupling and increased memory usage of address_space structure makes the performance go into regression (about -2% for worst real world workload). I found this part very performance sensitive, so basically I'm not making much progress for the future items I mentioned in this cover letter. New ideas could be very helpful! > > > > I want to discuss those ideas in the upcoming LSF/MM meet up as well. > > Looking forward to it! > > > > > Chris
On Sun, Apr 28, 2024 at 10:37 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Sat, Apr 27, 2024 at 7:16 AM Chris Li <chrisl@kernel.org> wrote: > > > > Hi Ying, > > > > On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote: > > > > > > Hi, Matthew, > > > > > > Matthew Wilcox <willy@infradead.org> writes: > > > > > > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote: > > > >> Is it possible to add "start_offset" support in xarray, so "index" > > > >> will subtract "start_offset" before looking up / inserting? > > > > > > > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for > > > > XA_FLAGS_ALLOC1. But that's just one bit for the entry at 0. We could > > > > generalise it, but then we'd have to store that somewhere and there's > > > > no obvious good place to store it that wouldn't enlarge struct xarray, > > > > which I'd be reluctant to do. > > > > > > > >> Is it possible to use multiple range locks to protect one xarray to > > > >> improve the lock scalability? This is why we have multiple "struct > > > >> address_space" for one swap device. And, we may have same lock > > > >> contention issue for large files too. > > > > > > > > It's something I've considered. The issue is search marks. If we delete > > > > an entry, we may have to walk all the way up the xarray clearing bits as > > > > we go and I'd rather not grab a lock at each level. There's a convenient > > > > 4 byte hole between nr_values and parent where we could put it. > > > > > > > > Oh, another issue is that we use i_pages.xa_lock to synchronise > > > > address_space.nrpages, so I'm not sure that a per-node lock will help. > > > > > > Thanks for looking at this. > > > > > > > But I'm conscious that there are workloads which show contention on > > > > xa_lock as their limiting factor, so I'm open to ideas to improve all > > > > these things. > > > > > > I have no idea so far because my very limited knowledge about xarray. > > > > For the swap file usage, I have been considering an idea to remove the > > index part of the xarray from swap cache. Swap cache is different from > > file cache in a few aspects. > > For one if we want to have a folio equivalent of "large swap entry". > > Then the natural alignment of those swap offset on does not make > > sense. Ideally we should be able to write the folio to un-aligned swap > > file locations. > > > > Hi Chris, > > This sound interesting, I have a few questions though... > > Are you suggesting we handle swap on file and swap on device > differently? Swap on file is much less frequently used than swap on > device I think. That is not what I have in mind. The swap struct idea did not distinguish the swap file vs swap device.BTW, I sometimes use swap on file because I did not allocate a swap partition in advance. > > And what do you mean "index part of the xarray"? If we need a cache, > xarray still seems one of the best choices to hold the content. We still need to look up swap file offset -> folio. However if we allocate each swap offset a "struct swap", then the folio lookup can be as simple as get the swap_struc by offset, then atomic read of swap_structt->folio. Not sure how you come to the conclusion for "best choices"? It is one choice, but it has its drawbacks. The natural alignment requirement of xarray, e.g. 2M large swap entries need to be written to 2M aligned offset, that is an unnecessary restriction. If we allocate the "struct swap" ourselves, we have more flexibility. > > The other aspect for swap files is that, we already have different > > data structures organized around swap offset, swap_map and > > swap_cgroup. If we group the swap related data structure together. We > > can add a pointer to a union of folio or a shadow swap entry. We can > > use atomic updates on the swap struct member or breakdown the access > > lock by ranges just like swap cluster does. > > > > I want to discuss those ideas in the upcoming LSF/MM meet up as well. > > Looking forward to it! Thanks, I will post more when I get more progress on that. > > Oh, and BTW I'm also trying to breakdown the swap address space range > (from 64M to 16M, SWAP_ADDRESS_SPACE_SHIFT from 14 to > 12). It's a simple approach, but the coupling and increased memory > usage of address_space structure makes the performance go into > regression (about -2% for worst real world workload). I found this Yes, that sounds plausible. > part very performance sensitive, so basically I'm not making much > progress for the future items I mentioned in this cover letter. New > ideas could be very helpful! > The swap_struct idea is very different from what you are trying to do in this series. It is more related to my LSF/MM topic on the swap back end overhaul. More long term and bigger undertakings. Chris
From: Kairui Song <kasong@tencent.com> Currently we use one swap_address_space for every 64M chunk to reduce lock contention, this is like having a set of smaller swap files inside one big swap file. But when doing swap cache look up or insert, we are still using the offset of the whole large swap file. This is OK for correctness, as the offset (key) is unique. But Xarray is specially optimized for small indexes, it creates the redix tree levels lazily to be just enough to fit the largest key stored in one Xarray. So we are wasting tree nodes unnecessarily. For 64M chunk it should only take at most 3 level to contain everything. But we are using the offset from the whole swap file, so the offset (key) value will be way beyond 64M, and so will the tree level. Optimize this by reduce the swap cache search space into 64M scope. Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO is enabled, as swap out path can never skip swap cache): Before: 6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k 0inputs+0outputs (55major+33555018minor)pagefaults 0swaps After (+1.8% faster): 6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k 0inputs+0outputs (54major+33555027minor)pagefaults 0swaps Similar result with MySQL and sysbench using swap: Before: 94055.61 qps After (+0.8% faster): 94834.91 qps There is alse a very slight drop of radix tree node slab usage: Before: 303952K After: 302224K For this series: There are multiple places that expect mixed type of pages (page cache or swap cache), eg. migration, huge memory split; There are four helpers for that: - page_index - page_file_offset - folio_index - folio_file_pos So this series first cleaned up usage of page_index and page_file_offset, then convert folio_index and folio_file_pos to be compatible with separate offsets. And introduce a new helper swap_cache_index for swap internal usage, replace swp_offset with swap_cache_index when used to retrieve folio from swap cache. And idealy, we may want to reduce SWAP_ADDRESS_SPACE_SHIFT from 14 to 12: Default Xarray chunk offset is 6, so we have 3 level trees instead of 2 level trees just for 2 extra bits. But swap cache is based on address_space struct, with 4 times more metadata sparsely distributed in memory it waste more cacheline, the performance gain from this series is almost canceled. So firstly, just have a cleaner seperation of offsets. Patch 1/8 - 6/8: Clean up usage of page_index and page_file_offset Patch 7/8: Convert folio_index and folio_file_pos to be compatible with separate offset. Patch 8/8: Introduce swap_cache_index and use it when doing lookup in swap cache. This series is part of effort to reduce swap cache overhead, and ultimately remove SWP_SYNCHRONOUS_IO and unify swap cache usage as proposed before: https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com/ Kairui Song (8): NFS: remove nfs_page_lengthg and usage of page_index nilfs2: drop usage of page_index f2fs: drop usage of page_index ceph: drop usage of page_index cifs: drop usage of page_file_offset mm/swap: get the swap file offset directly mm: drop page_index/page_file_offset and convert swap helpers to use folio mm/swap: reduce swap cache search space fs/ceph/dir.c | 2 +- fs/ceph/inode.c | 2 +- fs/f2fs/data.c | 5 ++--- fs/nfs/internal.h | 19 ------------------- fs/nilfs2/bmap.c | 3 +-- fs/smb/client/file.c | 2 +- include/linux/mm.h | 13 ------------- include/linux/pagemap.h | 19 +++++++++---------- mm/huge_memory.c | 2 +- mm/memcontrol.c | 2 +- mm/mincore.c | 2 +- mm/page_io.c | 6 +++--- mm/shmem.c | 2 +- mm/swap.h | 12 ++++++++++++ mm/swap_state.c | 12 ++++++------ mm/swapfile.c | 17 +++++++++++------ 16 files changed, 51 insertions(+), 69 deletions(-)