Message ID | 20240117-zswap-xarray-v1-1-6daa86c08fae@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | RFC: zswap tree use xarray instead of RB tree | expand |
On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > The xarray tree is added alongside the zswap RB tree. > Checks for the xarray get the same result as the RB tree operations. > > Rename the zswap RB tree function to a more generic function > name without the RB part. As I mentioned in the cover letter, I believe this should be squashed into the second patch. I have some comments below as well on the parts that should remain after the squash. [..] > > @@ -462,9 +463,9 @@ static void zswap_lru_putback(struct list_lru *list_lru, > /********************************* > * rbtree functions > **********************************/ > -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset) > +static struct zswap_entry *zswap_search(struct zswap_tree *tree, pgoff_t offset) Let's change the zswap_rb_* prefixes to zswap_tree_* instead of just zswap_*. Otherwise, it will be confusing to have both zswap_store and zswap_insert (as well as zswap_load and zswap_search). [..] > @@ -1790,15 +1808,21 @@ void zswap_swapon(int type) > void zswap_swapoff(int type) > { > struct zswap_tree *tree = zswap_trees[type]; > - struct zswap_entry *entry, *n; > + struct zswap_entry *entry, *e, *n; > + XA_STATE(xas, tree ? &tree->xarray : NULL, 0); > > if (!tree) > return; > > /* walk the tree and free everything */ > spin_lock(&tree->lock); > + > + xas_for_each(&xas, e, ULONG_MAX) Why not use xa_for_each? > + zswap_invalidate_entry(tree, e); > + > rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) > - zswap_free_entry(entry); Replacing zswap_free_entry() with zswap_invalidate_entry() is a behavioral change that should be done separate from this series, but I am wondering why it's needed. IIUC, the swapoff code should be making sure there are no ongoing swapin/swapout operations, and there are no pages left in zswap to writeback. Is it the case that swapoff may race with writeback, such that writeback is holding the last remaining ref after zswap_invalidate() is called, and then zswap_swapoff() is called freeing the zswap entry while writeback is still accessing it?
On Wed, Jan 17, 2024 at 10:20:29PM -0800, Yosry Ahmed wrote: > > /* walk the tree and free everything */ > > spin_lock(&tree->lock); > > + > > + xas_for_each(&xas, e, ULONG_MAX) > > Why not use xa_for_each? xas_for_each() is O(n) while xa_for_each() is O(n log n), as mentioned in the fine documentation. If you don't need to drop the lock while in the body of the loop, always prefer xas_for_each().
On Thu, Jan 18, 2024 at 5:52 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Jan 17, 2024 at 10:20:29PM -0800, Yosry Ahmed wrote: > > > /* walk the tree and free everything */ > > > spin_lock(&tree->lock); > > > + > > > + xas_for_each(&xas, e, ULONG_MAX) > > > > Why not use xa_for_each? > > xas_for_each() is O(n) while xa_for_each() is O(n log n), as mentioned > in the fine documentation. If you don't need to drop the lock while > in the body of the loop, always prefer xas_for_each(). Thanks for pointing this out. Out of ignorance, I skipped reading the doc for this one and operated under the general assumption to use xa_* functions were possible. The doc also says we should hold either the RCU read lock or the xa_lock while iterating, we are not doing either here, no?
On Thu, Jan 18, 2024 at 08:59:55AM -0800, Yosry Ahmed wrote: > On Thu, Jan 18, 2024 at 5:52 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, Jan 17, 2024 at 10:20:29PM -0800, Yosry Ahmed wrote: > > > > /* walk the tree and free everything */ > > > > spin_lock(&tree->lock); > > > > + > > > > + xas_for_each(&xas, e, ULONG_MAX) > > > > > > Why not use xa_for_each? > > > > xas_for_each() is O(n) while xa_for_each() is O(n log n), as mentioned > > in the fine documentation. If you don't need to drop the lock while > > in the body of the loop, always prefer xas_for_each(). > > Thanks for pointing this out. Out of ignorance, I skipped reading the > doc for this one and operated under the general assumption to use xa_* > functions were possible. > > The doc also says we should hold either the RCU read lock or the > xa_lock while iterating, we are not doing either here, no? I have no idea; I haven't studied the patches in detail yet. I have debugging assertions for that, so I was assuming that Chris had been developing with debug options turned on. If not, I guess the bots will do it for him.
Hi Yosry, On Wed, Jan 17, 2024 at 10:21 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > > > The xarray tree is added alongside the zswap RB tree. > > Checks for the xarray get the same result as the RB tree operations. > > > > Rename the zswap RB tree function to a more generic function > > name without the RB part. > > As I mentioned in the cover letter, I believe this should be squashed > into the second patch. I have some comments below as well on the parts > that should remain after the squash. > > [..] > > > > @@ -462,9 +463,9 @@ static void zswap_lru_putback(struct list_lru *list_lru, > > /********************************* > > * rbtree functions > > **********************************/ > > -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset) > > +static struct zswap_entry *zswap_search(struct zswap_tree *tree, pgoff_t offset) > > Let's change the zswap_rb_* prefixes to zswap_tree_* instead of just > zswap_*. Otherwise, it will be confusing to have both zswap_store and > zswap_insert (as well as zswap_load and zswap_search). How about zswap_xa_* ? > > [..] > > @@ -1790,15 +1808,21 @@ void zswap_swapon(int type) > > void zswap_swapoff(int type) > > { > > struct zswap_tree *tree = zswap_trees[type]; > > - struct zswap_entry *entry, *n; > > + struct zswap_entry *entry, *e, *n; > > + XA_STATE(xas, tree ? &tree->xarray : NULL, 0); > > > > if (!tree) > > return; > > > > /* walk the tree and free everything */ > > spin_lock(&tree->lock); > > + > > + xas_for_each(&xas, e, ULONG_MAX) > > Why not use xa_for_each? > > > + zswap_invalidate_entry(tree, e); > > + > > rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) > > - zswap_free_entry(entry); > > Replacing zswap_free_entry() with zswap_invalidate_entry() is a > behavioral change that should be done separate from this series, but I > am wondering why it's needed. IIUC, the swapoff code should be making > sure there are no ongoing swapin/swapout operations, and there are no > pages left in zswap to writeback. > > Is it the case that swapoff may race with writeback, such that > writeback is holding the last remaining ref after zswap_invalidate() > is called, and then zswap_swapoff() is called freeing the zswap entry > while writeback is still accessing it? For the RB tree the mapping is stored in the zswap entry as RB node. That is different from xarray. Xarry stores the mapping outside of zswap entry. Just freeing the entry does not remove the mapping from xarray. Therefore it needs to call zswap_invalidate_entry() to remove the entry from the xarray. I could call zswap_erase() then free entry. I just think zswap_invalidate_entry() is more consistent with the rest of the code. Chris
On Thu, Jan 18, 2024 at 10:25 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Jan 18, 2024 at 08:59:55AM -0800, Yosry Ahmed wrote: > > On Thu, Jan 18, 2024 at 5:52 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Wed, Jan 17, 2024 at 10:20:29PM -0800, Yosry Ahmed wrote: > > > > > /* walk the tree and free everything */ > > > > > spin_lock(&tree->lock); > > > > > + > > > > > + xas_for_each(&xas, e, ULONG_MAX) > > > > > > > > Why not use xa_for_each? > > > > > > xas_for_each() is O(n) while xa_for_each() is O(n log n), as mentioned > > > in the fine documentation. If you don't need to drop the lock while > > > in the body of the loop, always prefer xas_for_each(). > > > > Thanks for pointing this out. Out of ignorance, I skipped reading the > > doc for this one and operated under the general assumption to use xa_* > > functions were possible. > > > > The doc also says we should hold either the RCU read lock or the > > xa_lock while iterating, we are not doing either here, no? > > I have no idea; I haven't studied the patches in detail yet. I have > debugging assertions for that, so I was assuming that Chris had been > developing with debug options turned on. If not, I guess the bots will > do it for him. It is fine now because we have the extra zswap tree spin lock. When we remove the zswap tree spin lock it does require RCU read lock. You are right I would get assertion failure. Chris
My previous email got messed up, sorry. > > > @@ -462,9 +463,9 @@ static void zswap_lru_putback(struct list_lru *list_lru, > > > /********************************* > > > * rbtree functions > > > **********************************/ > > > -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset) > > > +static struct zswap_entry *zswap_search(struct zswap_tree *tree, pgoff_t offset) > > > > Let's change the zswap_rb_* prefixes to zswap_tree_* instead of just > > zswap_*. Otherwise, it will be confusing to have both zswap_store and > > zswap_insert (as well as zswap_load and zswap_search). > > How about zswap_xa_* ? SGTM. > > > > [..] > > > @@ -1790,15 +1808,21 @@ void zswap_swapon(int type) > > > void zswap_swapoff(int type) > > > { > > > struct zswap_tree *tree = zswap_trees[type]; > > > - struct zswap_entry *entry, *n; > > > + struct zswap_entry *entry, *e, *n; > > > + XA_STATE(xas, tree ? &tree->xarray : NULL, 0); > > > > > > if (!tree) > > > return; > > > > > > /* walk the tree and free everything */ > > > spin_lock(&tree->lock); > > > + > > > + xas_for_each(&xas, e, ULONG_MAX) > > > > Why not use xa_for_each? > > > > > + zswap_invalidate_entry(tree, e); > > > + > > > rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) > > > - zswap_free_entry(entry); > > > > Replacing zswap_free_entry() with zswap_invalidate_entry() is a > > behavioral change that should be done separate from this series, but I > > am wondering why it's needed. IIUC, the swapoff code should be making > > sure there are no ongoing swapin/swapout operations, and there are no > > pages left in zswap to writeback. > > > > Is it the case that swapoff may race with writeback, such that > > writeback is holding the last remaining ref after zswap_invalidate() > > is called, and then zswap_swapoff() is called freeing the zswap entry > > while writeback is still accessing it? > > For the RB tree the mapping is stored in the zswap entry as RB node. > That is different from xarray. Xarry stores the mapping outside of > zswap entry. Just freeing the entry does not remove the mapping from > xarray. Therefore it needs to call zswap_invalidate_entry() to remove > the entry from the xarray. I could call zswap_erase() then free entry. > I just think zswap_invalidate_entry() is more consistent with the rest > of the code. I see, but it's not clear to me if the xarray is being properly cleaned up in this case. Do we have to call xa_destroy() anyway to make sure everything is cleaned up in the xarray? In that case, we can just do that after the loop.
On Thu, Jan 18, 2024 at 9:28 PM Chris Li <chrisl@kernel.org> wrote: > > On Thu, Jan 18, 2024 at 10:25 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Jan 18, 2024 at 08:59:55AM -0800, Yosry Ahmed wrote: > > > On Thu, Jan 18, 2024 at 5:52 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Wed, Jan 17, 2024 at 10:20:29PM -0800, Yosry Ahmed wrote: > > > > > > /* walk the tree and free everything */ > > > > > > spin_lock(&tree->lock); > > > > > > + > > > > > > + xas_for_each(&xas, e, ULONG_MAX) > > > > > > > > > > Why not use xa_for_each? > > > > > > > > xas_for_each() is O(n) while xa_for_each() is O(n log n), as mentioned > > > > in the fine documentation. If you don't need to drop the lock while > > > > in the body of the loop, always prefer xas_for_each(). > > > > > > Thanks for pointing this out. Out of ignorance, I skipped reading the > > > doc for this one and operated under the general assumption to use xa_* > > > functions were possible. > > > > > > The doc also says we should hold either the RCU read lock or the > > > xa_lock while iterating, we are not doing either here, no? > > > > I have no idea; I haven't studied the patches in detail yet. I have > > debugging assertions for that, so I was assuming that Chris had been > > developing with debug options turned on. If not, I guess the bots will > > do it for him. > > It is fine now because we have the extra zswap tree spin lock. When we > remove the zswap tree spin lock it does require RCU read lock. You are > right I would get assertion failure. I would imagine the assertions are that we are holding either the RCU read lock or the xa_lock, how would holding the zswap tree lock help?
On Fri, Jan 19, 2024 at 11:29:42AM -0800, Yosry Ahmed wrote: > I see, but it's not clear to me if the xarray is being properly > cleaned up in this case. > > Do we have to call xa_destroy() anyway to make sure everything is > cleaned up in the xarray? In that case, we can just do that after the > loop. You do not need to call xa_destroy(). xa_destroy() exists for two patterns: first, that you're storing values, not pointers in the tree, and you can just delete the tree without leaking memory. second, that you do xas_for_each() { kfree(p); }; xa_destroy(); that's more efficient than xas_for_each() { kfree(p); xas_store(NULL); } as it batches the freeing of the nodes to the end. if your code is naturally structured so that you delete the entries after freeing them, you have no reason to call xa_destroy().
On Fri, Jan 19, 2024 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jan 19, 2024 at 11:29:42AM -0800, Yosry Ahmed wrote: > > I see, but it's not clear to me if the xarray is being properly > > cleaned up in this case. > > > > Do we have to call xa_destroy() anyway to make sure everything is > > cleaned up in the xarray? In that case, we can just do that after the > > loop. > > You do not need to call xa_destroy(). xa_destroy() exists for two > patterns: first, that you're storing values, not pointers in the tree, > and you can just delete the tree without leaking memory. second, that > you do xas_for_each() { kfree(p); }; xa_destroy(); that's more > efficient than xas_for_each() { kfree(p); xas_store(NULL); } as it > batches the freeing of the nodes to the end. > > if your code is naturally structured so that you delete the entries > after freeing them, you have no reason to call xa_destroy(). Thanks for elaborating. Based on this, I believe doing xas_for_each() { zswap_free_entry(); }; xa_destroy(); is both closer to the current code structure and more efficient.
Hi Yosry, On Fri, Jan 19, 2024 at 1:41 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > if your code is naturally structured so that you delete the entries > > after freeing them, you have no reason to call xa_destroy(). > > Thanks for elaborating. Based on this, I believe doing xas_for_each() > { zswap_free_entry(); }; xa_destroy(); is both closer to the current > code structure and more efficient. > Can't do that in this case though. Because you get the RCU read lock on the tree. Other threads can still lookup the xarray (also RCU read lock) and get a pointer to the already freed memory. Chris
On Fri, Jan 19, 2024 at 2:05 PM Chris Li <chriscli@google.com> wrote: > > Hi Yosry, > > On Fri, Jan 19, 2024 at 1:41 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > if your code is naturally structured so that you delete the entries > > > after freeing them, you have no reason to call xa_destroy(). > > > > Thanks for elaborating. Based on this, I believe doing xas_for_each() > > { zswap_free_entry(); }; xa_destroy(); is both closer to the current > > code structure and more efficient. > > > Can't do that in this case though. Because you get the RCU read lock > on the tree. > Other threads can still lookup the xarray (also RCU read lock) and get > a pointer to the already freed memory. During swapoff no one should be trying to swapin or swapout, where would the lookups come from?
diff --git a/mm/zswap.c b/mm/zswap.c index f8bc9e089268..a40b0076722b 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -235,6 +235,7 @@ struct zswap_entry { */ struct zswap_tree { struct rb_root rbroot; + struct xarray xarray; spinlock_t lock; }; @@ -462,9 +463,9 @@ static void zswap_lru_putback(struct list_lru *list_lru, /********************************* * rbtree functions **********************************/ -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset) +static struct zswap_entry *zswap_search(struct zswap_tree *tree, pgoff_t offset) { - struct rb_node *node = root->rb_node; + struct rb_node *node = tree->rbroot.rb_node; struct zswap_entry *entry; pgoff_t entry_offset; @@ -475,8 +476,12 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset) node = node->rb_left; else if (entry_offset < offset) node = node->rb_right; - else + else { + struct zswap_entry *e = xa_load(&tree->xarray, offset); + + BUG_ON(entry != e); return entry; + } } return NULL; } @@ -485,13 +490,15 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset) * In the case that a entry with the same offset is found, a pointer to * the existing entry is stored in dupentry and the function returns -EEXIST */ -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry, +static int zswap_insert(struct zswap_tree *tree, struct zswap_entry *entry, struct zswap_entry **dupentry) { + struct rb_root *root = &tree->rbroot; struct rb_node **link = &root->rb_node, *parent = NULL; - struct zswap_entry *myentry; + struct zswap_entry *myentry, *old; pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry); + while (*link) { parent = *link; myentry = rb_entry(parent, struct zswap_entry, rbnode); @@ -501,19 +508,26 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry, else if (myentry_offset < entry_offset) link = &(*link)->rb_right; else { + old = xa_load(&tree->xarray, entry_offset); + BUG_ON(old != myentry); *dupentry = myentry; return -EEXIST; } } rb_link_node(&entry->rbnode, parent, link); rb_insert_color(&entry->rbnode, root); + old = xa_store(&tree->xarray, entry_offset, entry, GFP_KERNEL); return 0; } -static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) +static bool zswap_erase(struct zswap_tree *tree, struct zswap_entry *entry) { + pgoff_t offset = swp_offset(entry->swpentry); if (!RB_EMPTY_NODE(&entry->rbnode)) { - rb_erase(&entry->rbnode, root); + struct zswap_entry *old; + old = xa_erase(&tree->xarray, offset); + BUG_ON(old != entry); + rb_erase(&entry->rbnode, &tree->rbroot); RB_CLEAR_NODE(&entry->rbnode); return true; } @@ -575,12 +589,12 @@ static void zswap_entry_put(struct zswap_tree *tree, } /* caller must hold the tree lock */ -static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, +static struct zswap_entry *zswap_entry_find_get(struct zswap_tree *tree, pgoff_t offset) { struct zswap_entry *entry; - entry = zswap_rb_search(root, offset); + entry = zswap_search(tree, offset); if (entry) zswap_entry_get(entry); @@ -845,7 +859,7 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) static void zswap_invalidate_entry(struct zswap_tree *tree, struct zswap_entry *entry) { - if (zswap_rb_erase(&tree->rbroot, entry)) + if (zswap_erase(tree, entry)) zswap_entry_put(tree, entry); } @@ -875,7 +889,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o /* Check for invalidate() race */ spin_lock(&tree->lock); - if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) + if (entry != zswap_search(tree, swpoffset)) goto unlock; /* Hold a reference to prevent a free during writeback */ @@ -1407,6 +1421,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, struct zswap_tree *tree) { swp_entry_t swpentry = entry->swpentry; + pgoff_t offset = swp_offset(swpentry); + struct zswap_entry *e; struct folio *folio; struct mempolicy *mpol; bool folio_was_allocated; @@ -1439,7 +1455,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, * avoid overwriting a new swap folio with old compressed data. */ spin_lock(&tree->lock); - if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) { + e = zswap_search(tree, offset); + if (e != entry) { spin_unlock(&tree->lock); delete_from_swap_cache(folio); return -ENOMEM; @@ -1528,7 +1545,7 @@ bool zswap_store(struct folio *folio) * the tree, and it might be written back overriding the new data. */ spin_lock(&tree->lock); - dupentry = zswap_rb_search(&tree->rbroot, offset); + dupentry = zswap_search(tree, offset); if (dupentry) { zswap_duplicate_entry++; zswap_invalidate_entry(tree, dupentry); @@ -1671,7 +1688,7 @@ bool zswap_store(struct folio *folio) * found again here it means that something went wrong in the swap * cache. */ - while (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) { + while (zswap_insert(tree, entry, &dupentry) == -EEXIST) { WARN_ON(1); zswap_duplicate_entry++; zswap_invalidate_entry(tree, dupentry); @@ -1722,7 +1739,7 @@ bool zswap_load(struct folio *folio) /* find */ spin_lock(&tree->lock); - entry = zswap_entry_find_get(&tree->rbroot, offset); + entry = zswap_entry_find_get(tree, offset); if (!entry) { spin_unlock(&tree->lock); return false; @@ -1762,7 +1779,7 @@ void zswap_invalidate(int type, pgoff_t offset) /* find */ spin_lock(&tree->lock); - entry = zswap_rb_search(&tree->rbroot, offset); + entry = zswap_search(tree, offset); if (!entry) { /* entry was written back */ spin_unlock(&tree->lock); @@ -1783,6 +1800,7 @@ void zswap_swapon(int type) } tree->rbroot = RB_ROOT; + xa_init(&tree->xarray); spin_lock_init(&tree->lock); zswap_trees[type] = tree; } @@ -1790,15 +1808,21 @@ void zswap_swapon(int type) void zswap_swapoff(int type) { struct zswap_tree *tree = zswap_trees[type]; - struct zswap_entry *entry, *n; + struct zswap_entry *entry, *e, *n; + XA_STATE(xas, tree ? &tree->xarray : NULL, 0); if (!tree) return; /* walk the tree and free everything */ spin_lock(&tree->lock); + + xas_for_each(&xas, e, ULONG_MAX) + zswap_invalidate_entry(tree, e); + rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) - zswap_free_entry(entry); + BUG_ON(entry); + tree->rbroot = RB_ROOT; spin_unlock(&tree->lock); kfree(tree);