Message ID | 48b349a2a0f7c76e18772712d0997a5e12ab0a3b.1728314403.git.lorenzo.stoakes@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | maple_tree: correct tree corruption on spanning store | expand |
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241007 11:28]: > There has been a subtle bug present in the maple tree implementation from > its inception. > > This arises from how stores are performed - when a store occurs, it will > overwrite overlapping ranges and adjust the tree as necessary to > accommodate this. > > A range may always ultimately span two leaf nodes. In this instance we walk > the two leaf nodes, determine which elements are not overwritten to the > left and to the right of the start and end of the ranges respectively and > then rebalance the tree to contain these entries and the newly inserted > one. > > This kind of store is dubbed a 'spanning store' and is implemented by > mas_wr_spanning_store(). > > In order to reach this stage, mas_store_gfp() invokes mas_wr_preallocate(), > mas_wr_store_type() and mas_wr_walk() in turn to walk the tree and update > the object (mas) to traverse to the location where the write should be > performed, determining its store type. > > When a spanning store is required, this function returns false stopping at > the parent node which contains the target range, and mas_wr_store_type() > marks the mas->store_type as wr_spanning_store to denote this fact. > > When we go to perform the store in mas_wr_spanning_store(), we first > determine the elements AFTER the END of the range we wish to store (that > is, to the right of the entry to be inserted) - we do this by walking to > the NEXT pivot in the tree (i.e. r_mas.last + 1), starting at the node we > have just determined contains the range over which we intend to write. > > We then turn our attention to the entries to the left of the entry we are > inserting, whose state is represented by l_mas, and copy these into a 'big > node', which is a special node which contains enough slots to contain two > leaf node's worth of data. > > We then copy the entry we wish to store immediately after this - the copy > and the insertion of the new entry is performed by mas_store_b_node(). > > After this we copy the elements to the right of the end of the range which > we are inserting, if we have not exceeded the length of the node > (i.e. r_mas.offset <= r_mas.end). > > Herein lies the bug - under very specific circumstances, this logic can > break and corrupt the maple tree. > > Consider the following tree: > > Height > 0 Root Node > / \ > pivot = 0xffff / \ pivot = ULONG_MAX > / \ > 1 A [-----] ... > / \ > pivot = 0x4fff / \ pivot = 0xffff > / \ > 2 (LEAVES) B [-----] [-----] C > ^--- Last pivot 0xffff. > > Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note > that all ranges expressed in maple tree code are inclusive): > > 1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then > determines that this is a spanning store across nodes B and C. The mas > state is set such that the current node from which we traverse further > is node A. > > 2. In mas_wr_spanning_store() we try to find elements to the right of pivot > 0xffff by searching for an index of 0x10000: > > - mas_wr_walk_index() invokes mas_wr_walk_descend() and > mas_wr_node_walk() in turn. > > - mas_wr_node_walk() loops over entries in node A until EITHER it > finds an entry whose pivot equals or exceeds 0x10000 OR it > reaches the final entry. > > - Since no entry has a pivot equal to or exceeding 0x10000, pivot > 0xffff is selected, leading to node C. > > - mas_wr_walk_traverse() resets the mas state to traverse node C. We > loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk() > in turn once again. > > - Again, we reach the last entry in node C, which has a pivot of > 0xffff. > > 3. We then copy the elements to the left of 0x4000 in node B to the big > node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry > too. > > 4. We determine whether we have any entries to copy from the right of the > end of the range via - and with r_mas set up at the entry at pivot > 0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at > pivot 0xffff. > > 5. BUG! The maple tree is corrupted with a duplicate entry. > > This requires a very specific set of circumstances - we must be spanning > the last element in a leaf node, which is the last element in the parent > node. > > spanning store across two leaf nodes with a range that ends at that shared > pivot. > > A potential solution to this problem would simply be to reset the walk each > time we traverse r_mas, however given the rarity of this situation it seems > that would be rather inefficient. > > Instead, this patch detects if the right hand node is populated, i.e. has > anything we need to copy. > > We do so by only copying elements from the right of the entry being inserted > when the maximum value present exceeds the last, rather than basing this on > offset position. > > The patch also updates some comments and eliminates the unused bool return > value in mas_wr_walk_index(). > > The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree > in mmap_region()") seems to have made the probability of this event much > more likely, which is the point at which reports started to be submitted > concerning this bug. > > The motivation for this change arose from Bert Karwatzki's report of > encountering mm instability after the release of kernel v6.12-rc1 which, > after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration > options, was identified as maple tree corruption. > > After Bert very generously provided his time and ability to reproduce this > event consistently, I was able to finally identify that the issue discussed > in this commit message was occurring for him. > > Reported-and-tested-by: Bert Karwatzki <spasswolf@web.de> > Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/ > Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> > Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/ > Fixes: 54a611b60590 ("Maple Tree: add new data structure") > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> > --- > lib/maple_tree.c | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > index 20990ecba2dd..2fe59c534328 100644 > --- a/lib/maple_tree.c > +++ b/lib/maple_tree.c > @@ -2196,6 +2196,8 @@ static inline void mas_node_or_none(struct ma_state *mas, > > /* > * mas_wr_node_walk() - Find the correct offset for the index in the @mas. > + * If @mas->index cannot be found within the containing > + * node, we traverse to the last entry in the node. > * @wr_mas: The maple write state > * > * Uses mas_slot_locked() and does not need to worry about dead nodes. > @@ -3532,7 +3534,7 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas) > return true; > } > > -static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) > +static void mas_wr_walk_index(struct ma_wr_state *wr_mas) > { > struct ma_state *mas = wr_mas->mas; > > @@ -3541,11 +3543,9 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) > wr_mas->content = mas_slot_locked(mas, wr_mas->slots, > mas->offset); > if (ma_is_leaf(wr_mas->type)) > - return true; > + return; > mas_wr_walk_traverse(wr_mas); > - > } > - return true; > } > /* > * mas_extend_spanning_null() - Extend a store of a %NULL to include surrounding %NULLs. > @@ -3765,8 +3765,8 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) > memset(&b_node, 0, sizeof(struct maple_big_node)); > /* Copy l_mas and store the value in b_node. */ > mas_store_b_node(&l_wr_mas, &b_node, l_mas.end); > - /* Copy r_mas into b_node. */ > - if (r_mas.offset <= r_mas.end) > + /* Copy r_mas into b_node if there is anything to copy. */ > + if (r_mas.max > r_mas.last) > mas_mab_cp(&r_mas, r_mas.offset, r_mas.end, > &b_node, b_node.b_end + 1); > else > -- > 2.46.2
On Mon, Oct 7, 2024 at 8:28 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > There has been a subtle bug present in the maple tree implementation from > its inception. > > This arises from how stores are performed - when a store occurs, it will > overwrite overlapping ranges and adjust the tree as necessary to > accommodate this. > > A range may always ultimately span two leaf nodes. In this instance we walk > the two leaf nodes, determine which elements are not overwritten to the > left and to the right of the start and end of the ranges respectively and > then rebalance the tree to contain these entries and the newly inserted > one. > > This kind of store is dubbed a 'spanning store' and is implemented by > mas_wr_spanning_store(). > > In order to reach this stage, mas_store_gfp() invokes mas_wr_preallocate(), > mas_wr_store_type() and mas_wr_walk() in turn to walk the tree and update > the object (mas) to traverse to the location where the write should be > performed, determining its store type. > > When a spanning store is required, this function returns false stopping at > the parent node which contains the target range, and mas_wr_store_type() > marks the mas->store_type as wr_spanning_store to denote this fact. > > When we go to perform the store in mas_wr_spanning_store(), we first > determine the elements AFTER the END of the range we wish to store (that > is, to the right of the entry to be inserted) - we do this by walking to > the NEXT pivot in the tree (i.e. r_mas.last + 1), starting at the node we > have just determined contains the range over which we intend to write. > > We then turn our attention to the entries to the left of the entry we are > inserting, whose state is represented by l_mas, and copy these into a 'big > node', which is a special node which contains enough slots to contain two > leaf node's worth of data. > > We then copy the entry we wish to store immediately after this - the copy > and the insertion of the new entry is performed by mas_store_b_node(). > > After this we copy the elements to the right of the end of the range which > we are inserting, if we have not exceeded the length of the node > (i.e. r_mas.offset <= r_mas.end). > > Herein lies the bug - under very specific circumstances, this logic can > break and corrupt the maple tree. > > Consider the following tree: > > Height > 0 Root Node > / \ > pivot = 0xffff / \ pivot = ULONG_MAX > / \ > 1 A [-----] ... > / \ > pivot = 0x4fff / \ pivot = 0xffff > / \ > 2 (LEAVES) B [-----] [-----] C > ^--- Last pivot 0xffff. > > Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note > that all ranges expressed in maple tree code are inclusive): > > 1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then > determines that this is a spanning store across nodes B and C. The mas > state is set such that the current node from which we traverse further > is node A. > > 2. In mas_wr_spanning_store() we try to find elements to the right of pivot > 0xffff by searching for an index of 0x10000: > > - mas_wr_walk_index() invokes mas_wr_walk_descend() and > mas_wr_node_walk() in turn. > > - mas_wr_node_walk() loops over entries in node A until EITHER it > finds an entry whose pivot equals or exceeds 0x10000 OR it > reaches the final entry. > > - Since no entry has a pivot equal to or exceeding 0x10000, pivot > 0xffff is selected, leading to node C. > > - mas_wr_walk_traverse() resets the mas state to traverse node C. We > loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk() > in turn once again. > > - Again, we reach the last entry in node C, which has a pivot of > 0xffff. > > 3. We then copy the elements to the left of 0x4000 in node B to the big > node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry > too. > > 4. We determine whether we have any entries to copy from the right of the > end of the range via - and with r_mas set up at the entry at pivot > 0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at > pivot 0xffff. > > 5. BUG! The maple tree is corrupted with a duplicate entry. > > This requires a very specific set of circumstances - we must be spanning > the last element in a leaf node, which is the last element in the parent > node. > > spanning store across two leaf nodes with a range that ends at that shared > pivot. > > A potential solution to this problem would simply be to reset the walk each > time we traverse r_mas, however given the rarity of this situation it seems > that would be rather inefficient. > > Instead, this patch detects if the right hand node is populated, i.e. has > anything we need to copy. > > We do so by only copying elements from the right of the entry being inserted > when the maximum value present exceeds the last, rather than basing this on > offset position. > > The patch also updates some comments and eliminates the unused bool return > value in mas_wr_walk_index(). > > The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree > in mmap_region()") seems to have made the probability of this event much > more likely, which is the point at which reports started to be submitted > concerning this bug. > > The motivation for this change arose from Bert Karwatzki's report of > encountering mm instability after the release of kernel v6.12-rc1 which, > after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration > options, was identified as maple tree corruption. > > After Bert very generously provided his time and ability to reproduce this > event consistently, I was able to finally identify that the issue discussed > in this commit message was occurring for him. > > Reported-and-tested-by: Bert Karwatzki <spasswolf@web.de> > Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/ > Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> > Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/ > Fixes: 54a611b60590 ("Maple Tree: add new data structure") > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > lib/maple_tree.c | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > index 20990ecba2dd..2fe59c534328 100644 > --- a/lib/maple_tree.c > +++ b/lib/maple_tree.c > @@ -2196,6 +2196,8 @@ static inline void mas_node_or_none(struct ma_state *mas, > > /* > * mas_wr_node_walk() - Find the correct offset for the index in the @mas. > + * If @mas->index cannot be found within the containing > + * node, we traverse to the last entry in the node. > * @wr_mas: The maple write state > * > * Uses mas_slot_locked() and does not need to worry about dead nodes. > @@ -3532,7 +3534,7 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas) > return true; > } > > -static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) > +static void mas_wr_walk_index(struct ma_wr_state *wr_mas) > { > struct ma_state *mas = wr_mas->mas; > > @@ -3541,11 +3543,9 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) > wr_mas->content = mas_slot_locked(mas, wr_mas->slots, > mas->offset); > if (ma_is_leaf(wr_mas->type)) > - return true; > + return; > mas_wr_walk_traverse(wr_mas); > - > } > - return true; > } > /* > * mas_extend_spanning_null() - Extend a store of a %NULL to include surrounding %NULLs. > @@ -3765,8 +3765,8 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) > memset(&b_node, 0, sizeof(struct maple_big_node)); > /* Copy l_mas and store the value in b_node. */ > mas_store_b_node(&l_wr_mas, &b_node, l_mas.end); > - /* Copy r_mas into b_node. */ > - if (r_mas.offset <= r_mas.end) > + /* Copy r_mas into b_node if there is anything to copy. */ > + if (r_mas.max > r_mas.last) > mas_mab_cp(&r_mas, r_mas.offset, r_mas.end, > &b_node, b_node.b_end + 1); > else > -- > 2.46.2 Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
On Mon, Oct 07, 2024 at 04:28:32PM +0100, Lorenzo Stoakes wrote: >There has been a subtle bug present in the maple tree implementation from >its inception. > >This arises from how stores are performed - when a store occurs, it will >overwrite overlapping ranges and adjust the tree as necessary to >accommodate this. > >A range may always ultimately span two leaf nodes. In this instance we walk >the two leaf nodes, determine which elements are not overwritten to the >left and to the right of the start and end of the ranges respectively and >then rebalance the tree to contain these entries and the newly inserted >one. > >This kind of store is dubbed a 'spanning store' and is implemented by >mas_wr_spanning_store(). > >In order to reach this stage, mas_store_gfp() invokes mas_wr_preallocate(), >mas_wr_store_type() and mas_wr_walk() in turn to walk the tree and update >the object (mas) to traverse to the location where the write should be >performed, determining its store type. > >When a spanning store is required, this function returns false stopping at >the parent node which contains the target range, and mas_wr_store_type() >marks the mas->store_type as wr_spanning_store to denote this fact. > >When we go to perform the store in mas_wr_spanning_store(), we first >determine the elements AFTER the END of the range we wish to store (that >is, to the right of the entry to be inserted) - we do this by walking to >the NEXT pivot in the tree (i.e. r_mas.last + 1), starting at the node we >have just determined contains the range over which we intend to write. > >We then turn our attention to the entries to the left of the entry we are >inserting, whose state is represented by l_mas, and copy these into a 'big >node', which is a special node which contains enough slots to contain two >leaf node's worth of data. > >We then copy the entry we wish to store immediately after this - the copy >and the insertion of the new entry is performed by mas_store_b_node(). > >After this we copy the elements to the right of the end of the range which >we are inserting, if we have not exceeded the length of the node >(i.e. r_mas.offset <= r_mas.end). > >Herein lies the bug - under very specific circumstances, this logic can >break and corrupt the maple tree. > >Consider the following tree: > >Height > 0 Root Node > / \ > pivot = 0xffff / \ pivot = ULONG_MAX > / \ > 1 A [-----] ... > / \ > pivot = 0x4fff / \ pivot = 0xffff > / \ > 2 (LEAVES) B [-----] [-----] C > ^--- Last pivot 0xffff. > >Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note >that all ranges expressed in maple tree code are inclusive): > >1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then > determines that this is a spanning store across nodes B and C. The mas > state is set such that the current node from which we traverse further > is node A. > >2. In mas_wr_spanning_store() we try to find elements to the right of pivot > 0xffff by searching for an index of 0x10000: > > - mas_wr_walk_index() invokes mas_wr_walk_descend() and > mas_wr_node_walk() in turn. > > - mas_wr_node_walk() loops over entries in node A until EITHER it > finds an entry whose pivot equals or exceeds 0x10000 OR it > reaches the final entry. > > - Since no entry has a pivot equal to or exceeding 0x10000, pivot > 0xffff is selected, leading to node C. > > - mas_wr_walk_traverse() resets the mas state to traverse node C. We > loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk() > in turn once again. > > - Again, we reach the last entry in node C, which has a pivot of > 0xffff. > >3. We then copy the elements to the left of 0x4000 in node B to the big > node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry > too. > >4. We determine whether we have any entries to copy from the right of the > end of the range via - and with r_mas set up at the entry at pivot > 0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at > pivot 0xffff. > >5. BUG! The maple tree is corrupted with a duplicate entry. > >This requires a very specific set of circumstances - we must be spanning >the last element in a leaf node, which is the last element in the parent >node. > >spanning store across two leaf nodes with a range that ends at that shared >pivot. > >A potential solution to this problem would simply be to reset the walk each >time we traverse r_mas, however given the rarity of this situation it seems >that would be rather inefficient. > >Instead, this patch detects if the right hand node is populated, i.e. has >anything we need to copy. > >We do so by only copying elements from the right of the entry being inserted >when the maximum value present exceeds the last, rather than basing this on >offset position. > >The patch also updates some comments and eliminates the unused bool return >value in mas_wr_walk_index(). > >The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree >in mmap_region()") seems to have made the probability of this event much >more likely, which is the point at which reports started to be submitted >concerning this bug. > >The motivation for this change arose from Bert Karwatzki's report of >encountering mm instability after the release of kernel v6.12-rc1 which, >after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration >options, was identified as maple tree corruption. > >After Bert very generously provided his time and ability to reproduce this >event consistently, I was able to finally identify that the issue discussed >in this commit message was occurring for him. > >Reported-and-tested-by: Bert Karwatzki <spasswolf@web.de> >Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/ >Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> >Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/ >Fixes: 54a611b60590 ("Maple Tree: add new data structure") >Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 20990ecba2dd..2fe59c534328 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -2196,6 +2196,8 @@ static inline void mas_node_or_none(struct ma_state *mas, /* * mas_wr_node_walk() - Find the correct offset for the index in the @mas. + * If @mas->index cannot be found within the containing + * node, we traverse to the last entry in the node. * @wr_mas: The maple write state * * Uses mas_slot_locked() and does not need to worry about dead nodes. @@ -3532,7 +3534,7 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas) return true; } -static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) +static void mas_wr_walk_index(struct ma_wr_state *wr_mas) { struct ma_state *mas = wr_mas->mas; @@ -3541,11 +3543,9 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) wr_mas->content = mas_slot_locked(mas, wr_mas->slots, mas->offset); if (ma_is_leaf(wr_mas->type)) - return true; + return; mas_wr_walk_traverse(wr_mas); - } - return true; } /* * mas_extend_spanning_null() - Extend a store of a %NULL to include surrounding %NULLs. @@ -3765,8 +3765,8 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) memset(&b_node, 0, sizeof(struct maple_big_node)); /* Copy l_mas and store the value in b_node. */ mas_store_b_node(&l_wr_mas, &b_node, l_mas.end); - /* Copy r_mas into b_node. */ - if (r_mas.offset <= r_mas.end) + /* Copy r_mas into b_node if there is anything to copy. */ + if (r_mas.max > r_mas.last) mas_mab_cp(&r_mas, r_mas.offset, r_mas.end, &b_node, b_node.b_end + 1); else