diff mbox series

[3/4] maple_tree: use the correct min to calculate split

Message ID 20241020024628.22469-4-richard.weiyang@gmail.com (mailing list archive)
State New
Headers show
Series maple_tree: current split may result in deficient node | expand

Commit Message

Wei Yang Oct. 20, 2024, 2:46 a.m. UTC
The check in mab_calc_split() is to make sure the gap between
[0, split] won't be too small. But we don't pass the correct min.

First we may encounter a pivot[split] smaller than min. For example:

       mt_init_flags(mt, 0);
       for (count = 0; count <= 240; count++) {
               mas_set(&mas, count);
               mas_store(&mas, xa_mk_value(count));
       }

On splitting root for storing 240, the pivot[split] is smaller than min.
This result a huge (pivot[split] - min).

Second prev_l_mas.min is not initialized for the first iteration. This
means on splitting leaf node, this value is mostly taking no effect.

Since we are now calculating the split of mas->node, we should use the
mas->min instead of prev_l_mas.min.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
CC: Liam R. Howlett <Liam.Howlett@Oracle.com>
CC: Sidhartha Kumar <sidhartha.kumar@oracle.com>
CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 lib/maple_tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Liam R. Howlett Oct. 20, 2024, 10 p.m. UTC | #1
* Wei Yang <richard.weiyang@gmail.com> [241019 22:46]:
> The check in mab_calc_split() is to make sure the gap between
> [0, split] won't be too small. But we don't pass the correct min.
> 
> First we may encounter a pivot[split] smaller than min. For example:
> 
>        mt_init_flags(mt, 0);
>        for (count = 0; count <= 240; count++) {
>                mas_set(&mas, count);
>                mas_store(&mas, xa_mk_value(count));
>        }
> 
> On splitting root for storing 240, the pivot[split] is smaller than min.
> This result a huge (pivot[split] - min).

This is fine.

There is an open work item to make it more accurate at higher levels,
but it's not a problem as it is.

Each level upwards needs a better 'minimum span', meaning that the node
should have at least mas.min - mas.min + level * something.

It works today for leaves, somewhat.

> 
> Second prev_l_mas.min is not initialized for the first iteration. This
> means on splitting leaf node, this value is mostly taking no effect.

No, it is set to 0.  Not initialized would cause random data loss.

See MA_STATE() in the header.

> 
> Since we are now calculating the split of mas->node, we should use the
> mas->min instead of prev_l_mas.min.

This sounds reasonable, but considering what this number is used for, I
don't see how it is working as well as it is today.  I will need to look
deeper into this.

> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> CC: Liam R. Howlett <Liam.Howlett@Oracle.com>
> CC: Sidhartha Kumar <sidhartha.kumar@oracle.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  lib/maple_tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 894dc5e9436e..c2d4b188646c 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -3357,7 +3357,7 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node)
>  		if (mas_push_data(mas, height, &mast, false))
>  			break;
>  
> -		split = mab_calc_split(mas, b_node, &mid_split, prev_l_mas.min);
> +		split = mab_calc_split(mas, b_node, &mid_split, mas->min);
>  		mast_split_data(&mast, mas, split);
>  		/*
>  		 * Usually correct, mab_mas_cp in the above call overwrites
> -- 
> 2.34.1
>
Liam R. Howlett Oct. 29, 2024, 5:49 p.m. UTC | #2
* Liam R. Howlett <Liam.Howlett@oracle.com> [241020 18:00]:
> * Wei Yang <richard.weiyang@gmail.com> [241019 22:46]:
> > The check in mab_calc_split() is to make sure the gap between
> > [0, split] won't be too small. But we don't pass the correct min.
> > 
> > First we may encounter a pivot[split] smaller than min. For example:
> > 
> >        mt_init_flags(mt, 0);
> >        for (count = 0; count <= 240; count++) {
> >                mas_set(&mas, count);
> >                mas_store(&mas, xa_mk_value(count));
> >        }
> > 
> > On splitting root for storing 240, the pivot[split] is smaller than min.
> > This result a huge (pivot[split] - min).
> 
> This is fine.
> 
> There is an open work item to make it more accurate at higher levels,
> but it's not a problem as it is.
> 
> Each level upwards needs a better 'minimum span', meaning that the node
> should have at least mas.min - mas.min + level * something.
> 
> It works today for leaves, somewhat.
> 
> > 
> > Second prev_l_mas.min is not initialized for the first iteration. This
> > means on splitting leaf node, this value is mostly taking no effect.
> 
> No, it is set to 0.  Not initialized would cause random data loss.
> 
> See MA_STATE() in the header.
> 
> > 
> > Since we are now calculating the split of mas->node, we should use the
> > mas->min instead of prev_l_mas.min.
> 
> This sounds reasonable, but considering what this number is used for, I
> don't see how it is working as well as it is today.  I will need to look
> deeper into this.


On examination of what is happening here, the only way this makes a
difference during the testcases is if we have a node with 16 entries,
we'll put 8 in the left today and 9 in the left after this change.

This only matters if the range is less than the slot count, so the real
world implications of the change will be negligible, if anything.

I honestly think I'm trying to be too smart here, especially at the
leaves.  We should just set mid_split = 0; in that complex statement and
drop the min argument all together.  It hasn't made a difference besides
the number of instructions executed.

> 
> > 
> > Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> > CC: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > CC: Sidhartha Kumar <sidhartha.kumar@oracle.com>
> > CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  lib/maple_tree.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > index 894dc5e9436e..c2d4b188646c 100644
> > --- a/lib/maple_tree.c
> > +++ b/lib/maple_tree.c
> > @@ -3357,7 +3357,7 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node)
> >  		if (mas_push_data(mas, height, &mast, false))
> >  			break;
> >  
> > -		split = mab_calc_split(mas, b_node, &mid_split, prev_l_mas.min);
> > +		split = mab_calc_split(mas, b_node, &mid_split, mas->min);
> >  		mast_split_data(&mast, mas, split);
> >  		/*
> >  		 * Usually correct, mab_mas_cp in the above call overwrites
> > -- 
> > 2.34.1
> >
Wei Yang Nov. 8, 2024, 2:55 a.m. UTC | #3
On Sun, Oct 20, 2024 at 06:00:35PM -0400, Liam R. Howlett wrote:
>* Wei Yang <richard.weiyang@gmail.com> [241019 22:46]:
>> The check in mab_calc_split() is to make sure the gap between
>> [0, split] won't be too small. But we don't pass the correct min.
>> 
>> First we may encounter a pivot[split] smaller than min. For example:
>> 
>>        mt_init_flags(mt, 0);
>>        for (count = 0; count <= 240; count++) {
>>                mas_set(&mas, count);
>>                mas_store(&mas, xa_mk_value(count));
>>        }
>> 
>> On splitting root for storing 240, the pivot[split] is smaller than min.
>> This result a huge (pivot[split] - min).
>
>This is fine.
>
>There is an open work item to make it more accurate at higher levels,
>but it's not a problem as it is.
>
>Each level upwards needs a better 'minimum span', meaning that the node
>should have at least mas.min - mas.min + level * something.

Don't follow this. 

First, I guess it is mas.max - mas.min?

Then there is sth related to level. But I don't see level is used for
calculation. Would you mind sharing more your original thought?

>
>It works today for leaves, somewhat.
>

To me it works by coincidence.

The condition check is:

  (bn->pivot[split] - min) < (slot_count - 1) 

while slot_count is a constent, 16 for leaf node.

And we always pass 0 as min, the condition for leaf is:

  bn->pivot[split] < 15

For the mmap world, per my understanding, we won't expect an address is smaller
than 15.

>> 
>> Second prev_l_mas.min is not initialized for the first iteration. This
>> means on splitting leaf node, this value is mostly taking no effect.
>
>No, it is set to 0.  Not initialized would cause random data loss.
>
>See MA_STATE() in the header.
>

Right, I want to say not initialized to the value we want.

Will be more careful next time.

>> 
>> Since we are now calculating the split of mas->node, we should use the
>> mas->min instead of prev_l_mas.min.
>
>This sounds reasonable, but considering what this number is used for, I
>don't see how it is working as well as it is today.  I will need to look
>deeper into this.
>
>> 
>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> CC: Liam R. Howlett <Liam.Howlett@Oracle.com>
>> CC: Sidhartha Kumar <sidhartha.kumar@oracle.com>
>> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> ---
>>  lib/maple_tree.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index 894dc5e9436e..c2d4b188646c 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -3357,7 +3357,7 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node)
>>  		if (mas_push_data(mas, height, &mast, false))
>>  			break;
>>  
>> -		split = mab_calc_split(mas, b_node, &mid_split, prev_l_mas.min);
>> +		split = mab_calc_split(mas, b_node, &mid_split, mas->min);
>>  		mast_split_data(&mast, mas, split);
>>  		/*
>>  		 * Usually correct, mab_mas_cp in the above call overwrites
>> -- 
>> 2.34.1
>>
Wei Yang Nov. 8, 2024, 3:02 a.m. UTC | #4
On Tue, Oct 29, 2024 at 01:49:28PM -0400, Liam R. Howlett wrote:
>* Liam R. Howlett <Liam.Howlett@oracle.com> [241020 18:00]:
>> * Wei Yang <richard.weiyang@gmail.com> [241019 22:46]:
>> > The check in mab_calc_split() is to make sure the gap between
>> > [0, split] won't be too small. But we don't pass the correct min.
>> > 
>> > First we may encounter a pivot[split] smaller than min. For example:
>> > 
>> >        mt_init_flags(mt, 0);
>> >        for (count = 0; count <= 240; count++) {
>> >                mas_set(&mas, count);
>> >                mas_store(&mas, xa_mk_value(count));
>> >        }
>> > 
>> > On splitting root for storing 240, the pivot[split] is smaller than min.
>> > This result a huge (pivot[split] - min).
>> 
>> This is fine.
>> 
>> There is an open work item to make it more accurate at higher levels,
>> but it's not a problem as it is.
>> 
>> Each level upwards needs a better 'minimum span', meaning that the node
>> should have at least mas.min - mas.min + level * something.
>> 
>> It works today for leaves, somewhat.
>> 
>> > 
>> > Second prev_l_mas.min is not initialized for the first iteration. This
>> > means on splitting leaf node, this value is mostly taking no effect.
>> 
>> No, it is set to 0.  Not initialized would cause random data loss.
>> 
>> See MA_STATE() in the header.
>> 
>> > 
>> > Since we are now calculating the split of mas->node, we should use the
>> > mas->min instead of prev_l_mas.min.
>> 
>> This sounds reasonable, but considering what this number is used for, I
>> don't see how it is working as well as it is today.  I will need to look
>> deeper into this.
>
>
>On examination of what is happening here, the only way this makes a
>difference during the testcases is if we have a node with 16 entries,
>we'll put 8 in the left today and 9 in the left after this change.
>
>This only matters if the range is less than the slot count, so the real
>world implications of the change will be negligible, if anything.
>
>I honestly think I'm trying to be too smart here, especially at the
>leaves.  We should just set mid_split = 0; in that complex statement and
>drop the min argument all together.  It hasn't made a difference besides
>the number of instructions executed.
>

This looks good to me. And it looks we won't face jitter problem after this.

I need to take a through look.

>> 
>> > 
>> > Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> > CC: Liam R. Howlett <Liam.Howlett@Oracle.com>
>> > CC: Sidhartha Kumar <sidhartha.kumar@oracle.com>
>> > CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> > ---
>> >  lib/maple_tree.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> > index 894dc5e9436e..c2d4b188646c 100644
>> > --- a/lib/maple_tree.c
>> > +++ b/lib/maple_tree.c
>> > @@ -3357,7 +3357,7 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node)
>> >  		if (mas_push_data(mas, height, &mast, false))
>> >  			break;
>> >  
>> > -		split = mab_calc_split(mas, b_node, &mid_split, prev_l_mas.min);
>> > +		split = mab_calc_split(mas, b_node, &mid_split, mas->min);
>> >  		mast_split_data(&mast, mas, split);
>> >  		/*
>> >  		 * Usually correct, mab_mas_cp in the above call overwrites
>> > -- 
>> > 2.34.1
>> >
Liam R. Howlett Nov. 8, 2024, 3:19 a.m. UTC | #5
* Wei Yang <richard.weiyang@gmail.com> [241107 21:55]:
> On Sun, Oct 20, 2024 at 06:00:35PM -0400, Liam R. Howlett wrote:
> >* Wei Yang <richard.weiyang@gmail.com> [241019 22:46]:
> >> The check in mab_calc_split() is to make sure the gap between
> >> [0, split] won't be too small. But we don't pass the correct min.
> >> 
> >> First we may encounter a pivot[split] smaller than min. For example:
> >> 
> >>        mt_init_flags(mt, 0);
> >>        for (count = 0; count <= 240; count++) {
> >>                mas_set(&mas, count);
> >>                mas_store(&mas, xa_mk_value(count));
> >>        }
> >> 
> >> On splitting root for storing 240, the pivot[split] is smaller than min.
> >> This result a huge (pivot[split] - min).
> >
> >This is fine.
> >
> >There is an open work item to make it more accurate at higher levels,
> >but it's not a problem as it is.
> >
> >Each level upwards needs a better 'minimum span', meaning that the node
> >should have at least mas.min - mas.min + level * something.
> 
> Don't follow this. 
> 
> First, I guess it is mas.max - mas.min?
> 

No, the idea is that we could do better if we could keep the data more
densely packed.

If we have singletons (all range of 1), and we have 10,000 singletons,
then a parent node with 0 - 128 and a sibling of 129-257 makes no sense,
there is nothing that can be inserting into the parents last two slots
- we can't split singletons smaller so we should have done a better job
at splitting.

If you think of how this grows today, we split then rebalance until the
left node is full and we have to split again.  This could be better
optimised to avoid multiple rebalances by doing something to push more
data to the left based on the span of the children and the nodes.  But
we'd want to keep enough so some modifications to each node doesn't
cause a rebalance in the opposite direction.

So a minimum span of parent nodes one level up from the leaves would be
16 * 10 = 160, so when we split we can push more to the left to try and
get to 160 while keeping the right node sufficient.  The math of the
span would change based on how far up the tree we go, but would be
easier to maintain.

I'd probably still want to split it 50/50 the first time, but it might
make sense to push more on the next rebalance.  so 10 => 6|5, 6|10 =>
9|5, but maybe it's not worth the effort.  I'm starting to think so,
considering it's been ineffective to date.  There is a chance that the
active area gets moved to the more-full node and ends up splitting
sooner anyways..

All of this will probably change when dense nodes come anyways.

> Then there is sth related to level. But I don't see level is used for
> calculation. Would you mind sharing more your original thought?
> 
> >
> >It works today for leaves, somewhat.
> >
> 
> To me it works by coincidence.

Seems like it.

> 
> The condition check is:
> 
>   (bn->pivot[split] - min) < (slot_count - 1) 
> 
> while slot_count is a constent, 16 for leaf node.
> 
> And we always pass 0 as min, the condition for leaf is:
> 
>   bn->pivot[split] < 15
> 
> For the mmap world, per my understanding, we won't expect an address is smaller
> than 15.

There are other users, but I think we should just drop this attempt at
fancy stuff.

> 
> >> 
> >> Second prev_l_mas.min is not initialized for the first iteration. This
> >> means on splitting leaf node, this value is mostly taking no effect.
> >
> >No, it is set to 0.  Not initialized would cause random data loss.
> >
> >See MA_STATE() in the header.
> >
> 
> Right, I want to say not initialized to the value we want.
> 
> Will be more careful next time.
> 
> >> 
> >> Since we are now calculating the split of mas->node, we should use the
> >> mas->min instead of prev_l_mas.min.
> >
> >This sounds reasonable, but considering what this number is used for, I
> >don't see how it is working as well as it is today.  I will need to look
> >deeper into this.
> >
> >> 
> >> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> >> CC: Liam R. Howlett <Liam.Howlett@Oracle.com>
> >> CC: Sidhartha Kumar <sidhartha.kumar@oracle.com>
> >> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >> ---
> >>  lib/maple_tree.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >> 
> >> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> >> index 894dc5e9436e..c2d4b188646c 100644
> >> --- a/lib/maple_tree.c
> >> +++ b/lib/maple_tree.c
> >> @@ -3357,7 +3357,7 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node)
> >>  		if (mas_push_data(mas, height, &mast, false))
> >>  			break;
> >>  
> >> -		split = mab_calc_split(mas, b_node, &mid_split, prev_l_mas.min);
> >> +		split = mab_calc_split(mas, b_node, &mid_split, mas->min);
> >>  		mast_split_data(&mast, mas, split);
> >>  		/*
> >>  		 * Usually correct, mab_mas_cp in the above call overwrites
> >> -- 
> >> 2.34.1
> >> 
> 
> -- 
> Wei Yang
> Help you, Help me
Wei Yang Nov. 9, 2024, 1:24 a.m. UTC | #6
On Tue, Oct 29, 2024 at 01:49:28PM -0400, Liam R. Howlett wrote:
>* Liam R. Howlett <Liam.Howlett@oracle.com> [241020 18:00]:
>> * Wei Yang <richard.weiyang@gmail.com> [241019 22:46]:
>> > The check in mab_calc_split() is to make sure the gap between
>> > [0, split] won't be too small. But we don't pass the correct min.
>> > 
>> > First we may encounter a pivot[split] smaller than min. For example:
>> > 
>> >        mt_init_flags(mt, 0);
>> >        for (count = 0; count <= 240; count++) {
>> >                mas_set(&mas, count);
>> >                mas_store(&mas, xa_mk_value(count));
>> >        }
>> > 
>> > On splitting root for storing 240, the pivot[split] is smaller than min.
>> > This result a huge (pivot[split] - min).
>> 
>> This is fine.
>> 
>> There is an open work item to make it more accurate at higher levels,
>> but it's not a problem as it is.
>> 
>> Each level upwards needs a better 'minimum span', meaning that the node
>> should have at least mas.min - mas.min + level * something.
>> 
>> It works today for leaves, somewhat.
>> 
>> > 
>> > Second prev_l_mas.min is not initialized for the first iteration. This
>> > means on splitting leaf node, this value is mostly taking no effect.
>> 
>> No, it is set to 0.  Not initialized would cause random data loss.
>> 
>> See MA_STATE() in the header.
>> 
>> > 
>> > Since we are now calculating the split of mas->node, we should use the
>> > mas->min instead of prev_l_mas.min.
>> 
>> This sounds reasonable, but considering what this number is used for, I
>> don't see how it is working as well as it is today.  I will need to look
>> deeper into this.
>
>
>On examination of what is happening here, the only way this makes a
>difference during the testcases is if we have a node with 16 entries,
>we'll put 8 in the left today and 9 in the left after this change.
>
>This only matters if the range is less than the slot count, so the real
>world implications of the change will be negligible, if anything.
>

Yes, maybe it only effect when there are all singleton.

>I honestly think I'm trying to be too smart here, especially at the
>leaves.  We should just set mid_split = 0; in that complex statement and
>drop the min argument all together.  It hasn't made a difference besides
>the number of instructions executed.
>

I have tried to remove those lines, so we got split = b_end / 2 and 
mid_split = 0. Tests all passed and kernel runs. (I guess in the real world,
it behaves the same, since (bn->pivot[split] - min) is always bigger than 15.)

Since we call mab_calc_split() when b_end >= slot_count == 16, it means we
get split == 16/2 == 8. So the new left|right node has data 9|8, this is
friendly to jitter problem. 

>> 
>> > 
>> > Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> > CC: Liam R. Howlett <Liam.Howlett@Oracle.com>
>> > CC: Sidhartha Kumar <sidhartha.kumar@oracle.com>
>> > CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> > ---
>> >  lib/maple_tree.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> > index 894dc5e9436e..c2d4b188646c 100644
>> > --- a/lib/maple_tree.c
>> > +++ b/lib/maple_tree.c
>> > @@ -3357,7 +3357,7 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node)
>> >  		if (mas_push_data(mas, height, &mast, false))
>> >  			break;
>> >  
>> > -		split = mab_calc_split(mas, b_node, &mid_split, prev_l_mas.min);
>> > +		split = mab_calc_split(mas, b_node, &mid_split, mas->min);
>> >  		mast_split_data(&mast, mas, split);
>> >  		/*
>> >  		 * Usually correct, mab_mas_cp in the above call overwrites
>> > -- 
>> > 2.34.1
>> >
Wei Yang Nov. 9, 2024, 1:40 a.m. UTC | #7
On Thu, Nov 07, 2024 at 10:19:37PM -0500, Liam R. Howlett wrote:
>* Wei Yang <richard.weiyang@gmail.com> [241107 21:55]:
>> On Sun, Oct 20, 2024 at 06:00:35PM -0400, Liam R. Howlett wrote:
>> >* Wei Yang <richard.weiyang@gmail.com> [241019 22:46]:
>> >> The check in mab_calc_split() is to make sure the gap between
>> >> [0, split] won't be too small. But we don't pass the correct min.
>> >> 
>> >> First we may encounter a pivot[split] smaller than min. For example:
>> >> 
>> >>        mt_init_flags(mt, 0);
>> >>        for (count = 0; count <= 240; count++) {
>> >>                mas_set(&mas, count);
>> >>                mas_store(&mas, xa_mk_value(count));
>> >>        }
>> >> 
>> >> On splitting root for storing 240, the pivot[split] is smaller than min.
>> >> This result a huge (pivot[split] - min).
>> >
>> >This is fine.
>> >
>> >There is an open work item to make it more accurate at higher levels,
>> >but it's not a problem as it is.
>> >
>> >Each level upwards needs a better 'minimum span', meaning that the node
>> >should have at least mas.min - mas.min + level * something.
>> 
>> Don't follow this. 
>> 
>> First, I guess it is mas.max - mas.min?
>> 
>
>No, the idea is that we could do better if we could keep the data more
>densely packed.
>
>If we have singletons (all range of 1), and we have 10,000 singletons,
>then a parent node with 0 - 128 and a sibling of 129-257 makes no sense,
>there is nothing that can be inserting into the parents last two slots
>- we can't split singletons smaller so we should have done a better job
>at splitting.
>
>If you think of how this grows today, we split then rebalance until the
>left node is full and we have to split again.  This could be better
>optimised to avoid multiple rebalances by doing something to push more
>data to the left based on the span of the children and the nodes.  But
>we'd want to keep enough so some modifications to each node doesn't
>cause a rebalance in the opposite direction.
>
>So a minimum span of parent nodes one level up from the leaves would be
>16 * 10 = 160, so when we split we can push more to the left to try and
>get to 160 while keeping the right node sufficient.  The math of the
>span would change based on how far up the tree we go, but would be
>easier to maintain.
>
>I'd probably still want to split it 50/50 the first time, but it might
>make sense to push more on the next rebalance.  so 10 => 6|5, 6|10 =>
>9|5, but maybe it's not worth the effort.  I'm starting to think so,
>considering it's been ineffective to date.  There is a chance that the
>active area gets moved to the more-full node and ends up splitting
>sooner anyways..
>
>All of this will probably change when dense nodes come anyways.
>
>> Then there is sth related to level. But I don't see level is used for
>> calculation. Would you mind sharing more your original thought?
>> 
>> >
>> >It works today for leaves, somewhat.
>> >
>> 
>> To me it works by coincidence.
>
>Seems like it.
>
>> 
>> The condition check is:
>> 
>>   (bn->pivot[split] - min) < (slot_count - 1) 
>> 
>> while slot_count is a constent, 16 for leaf node.
>> 
>> And we always pass 0 as min, the condition for leaf is:
>> 
>>   bn->pivot[split] < 15
>> 
>> For the mmap world, per my understanding, we won't expect an address is smaller
>> than 15.
>
>There are other users, but I think we should just drop this attempt at
>fancy stuff.
>

You mean "set mid_split = 0 and drop the min argument all together" mentioned
in another mail?

>> 
>> >> 
>> >> Second prev_l_mas.min is not initialized for the first iteration. This
>> >> means on splitting leaf node, this value is mostly taking no effect.
>> >
>> >No, it is set to 0.  Not initialized would cause random data loss.
>> >
>> >See MA_STATE() in the header.
>> >
>> 
>> Right, I want to say not initialized to the value we want.
>> 
>> Will be more careful next time.
>> 
>> >> 
>> >> Since we are now calculating the split of mas->node, we should use the
>> >> mas->min instead of prev_l_mas.min.
>> >
>> >This sounds reasonable, but considering what this number is used for, I
>> >don't see how it is working as well as it is today.  I will need to look
>> >deeper into this.
>> >
>> >> 
>> >> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> >> CC: Liam R. Howlett <Liam.Howlett@Oracle.com>
>> >> CC: Sidhartha Kumar <sidhartha.kumar@oracle.com>
>> >> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> >> ---
>> >>  lib/maple_tree.c | 2 +-
>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >> 
>> >> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> >> index 894dc5e9436e..c2d4b188646c 100644
>> >> --- a/lib/maple_tree.c
>> >> +++ b/lib/maple_tree.c
>> >> @@ -3357,7 +3357,7 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node)
>> >>  		if (mas_push_data(mas, height, &mast, false))
>> >>  			break;
>> >>  
>> >> -		split = mab_calc_split(mas, b_node, &mid_split, prev_l_mas.min);
>> >> +		split = mab_calc_split(mas, b_node, &mid_split, mas->min);
>> >>  		mast_split_data(&mast, mas, split);
>> >>  		/*
>> >>  		 * Usually correct, mab_mas_cp in the above call overwrites
>> >> -- 
>> >> 2.34.1
>> >> 
>> 
>> -- 
>> Wei Yang
>> Help you, Help me
Liam R. Howlett Nov. 9, 2024, 4:01 a.m. UTC | #8
* Wei Yang <richard.weiyang@gmail.com> [241108 20:40]:
> On Thu, Nov 07, 2024 at 10:19:37PM -0500, Liam R. Howlett wrote:
> >* Wei Yang <richard.weiyang@gmail.com> [241107 21:55]:
> >> On Sun, Oct 20, 2024 at 06:00:35PM -0400, Liam R. Howlett wrote:
> >> >* Wei Yang <richard.weiyang@gmail.com> [241019 22:46]:
> >> >> The check in mab_calc_split() is to make sure the gap between
> >> >> [0, split] won't be too small. But we don't pass the correct min.
> >> >> 
> >> >> First we may encounter a pivot[split] smaller than min. For example:
> >> >> 
> >> >>        mt_init_flags(mt, 0);
> >> >>        for (count = 0; count <= 240; count++) {
> >> >>                mas_set(&mas, count);
> >> >>                mas_store(&mas, xa_mk_value(count));
> >> >>        }
> >> >> 
> >> >> On splitting root for storing 240, the pivot[split] is smaller than min.
> >> >> This result a huge (pivot[split] - min).
> >> >
> >> >This is fine.
> >> >
> >> >There is an open work item to make it more accurate at higher levels,
> >> >but it's not a problem as it is.
> >> >
> >> >Each level upwards needs a better 'minimum span', meaning that the node
> >> >should have at least mas.min - mas.min + level * something.
> >> 
> >> Don't follow this. 
> >> 
> >> First, I guess it is mas.max - mas.min?
> >> 
> >
> >No, the idea is that we could do better if we could keep the data more
> >densely packed.
> >
> >If we have singletons (all range of 1), and we have 10,000 singletons,
> >then a parent node with 0 - 128 and a sibling of 129-257 makes no sense,
> >there is nothing that can be inserting into the parents last two slots
> >- we can't split singletons smaller so we should have done a better job
> >at splitting.
> >
> >If you think of how this grows today, we split then rebalance until the
> >left node is full and we have to split again.  This could be better
> >optimised to avoid multiple rebalances by doing something to push more
> >data to the left based on the span of the children and the nodes.  But
> >we'd want to keep enough so some modifications to each node doesn't
> >cause a rebalance in the opposite direction.
> >
> >So a minimum span of parent nodes one level up from the leaves would be
> >16 * 10 = 160, so when we split we can push more to the left to try and
> >get to 160 while keeping the right node sufficient.  The math of the
> >span would change based on how far up the tree we go, but would be
> >easier to maintain.
> >
> >I'd probably still want to split it 50/50 the first time, but it might
> >make sense to push more on the next rebalance.  so 10 => 6|5, 6|10 =>
> >9|5, but maybe it's not worth the effort.  I'm starting to think so,
> >considering it's been ineffective to date.  There is a chance that the
> >active area gets moved to the more-full node and ends up splitting
> >sooner anyways..
> >
> >All of this will probably change when dense nodes come anyways.
> >
> >> Then there is sth related to level. But I don't see level is used for
> >> calculation. Would you mind sharing more your original thought?
> >> 
> >> >
> >> >It works today for leaves, somewhat.
> >> >
> >> 
> >> To me it works by coincidence.
> >
> >Seems like it.
> >
> >> 
> >> The condition check is:
> >> 
> >>   (bn->pivot[split] - min) < (slot_count - 1) 
> >> 
> >> while slot_count is a constent, 16 for leaf node.
> >> 
> >> And we always pass 0 as min, the condition for leaf is:
> >> 
> >>   bn->pivot[split] < 15
> >> 
> >> For the mmap world, per my understanding, we won't expect an address is smaller
> >> than 15.
> >
> >There are other users, but I think we should just drop this attempt at
> >fancy stuff.
> >
> 
> You mean "set mid_split = 0 and drop the min argument all together" mentioned
> in another mail?

Yes, please.  The more I think about this, the more I think we're
wasting time trying to predict the future.  And, although fun, it
usually doesn't work out.

> 
> >> 
> >> >> 
> >> >> Second prev_l_mas.min is not initialized for the first iteration. This
> >> >> means on splitting leaf node, this value is mostly taking no effect.
> >> >
> >> >No, it is set to 0.  Not initialized would cause random data loss.
> >> >
> >> >See MA_STATE() in the header.
> >> >
> >> 
> >> Right, I want to say not initialized to the value we want.
> >> 
> >> Will be more careful next time.
> >> 
> >> >> 
> >> >> Since we are now calculating the split of mas->node, we should use the
> >> >> mas->min instead of prev_l_mas.min.
> >> >
> >> >This sounds reasonable, but considering what this number is used for, I
> >> >don't see how it is working as well as it is today.  I will need to look
> >> >deeper into this.
> >> >
> >> >> 
> >> >> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> >> >> CC: Liam R. Howlett <Liam.Howlett@Oracle.com>
> >> >> CC: Sidhartha Kumar <sidhartha.kumar@oracle.com>
> >> >> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >> >> ---
> >> >>  lib/maple_tree.c | 2 +-
> >> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >> >> 
> >> >> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> >> >> index 894dc5e9436e..c2d4b188646c 100644
> >> >> --- a/lib/maple_tree.c
> >> >> +++ b/lib/maple_tree.c
> >> >> @@ -3357,7 +3357,7 @@ static void mas_split(struct ma_state *mas, struct maple_big_node *b_node)
> >> >>  		if (mas_push_data(mas, height, &mast, false))
> >> >>  			break;
> >> >>  
> >> >> -		split = mab_calc_split(mas, b_node, &mid_split, prev_l_mas.min);
> >> >> +		split = mab_calc_split(mas, b_node, &mid_split, mas->min);
> >> >>  		mast_split_data(&mast, mas, split);
> >> >>  		/*
> >> >>  		 * Usually correct, mab_mas_cp in the above call overwrites
> >> >> -- 
> >> >> 2.34.1
> >> >> 
> >> 
> >> -- 
> >> Wei Yang
> >> Help you, Help me
> 
> -- 
> Wei Yang
> Help you, Help me
Wei Yang Nov. 9, 2024, 12:22 p.m. UTC | #9
On Fri, Nov 08, 2024 at 11:01:21PM -0500, Liam R. Howlett wrote:
>* Wei Yang <richard.weiyang@gmail.com> [241108 20:40]:
>> On Thu, Nov 07, 2024 at 10:19:37PM -0500, Liam R. Howlett wrote:
>> >* Wei Yang <richard.weiyang@gmail.com> [241107 21:55]:
>> >> On Sun, Oct 20, 2024 at 06:00:35PM -0400, Liam R. Howlett wrote:
>> >> >* Wei Yang <richard.weiyang@gmail.com> [241019 22:46]:
>> >> >> The check in mab_calc_split() is to make sure the gap between
>> >> >> [0, split] won't be too small. But we don't pass the correct min.
>> >> >> 
>> >> >> First we may encounter a pivot[split] smaller than min. For example:
>> >> >> 
>> >> >>        mt_init_flags(mt, 0);
>> >> >>        for (count = 0; count <= 240; count++) {
>> >> >>                mas_set(&mas, count);
>> >> >>                mas_store(&mas, xa_mk_value(count));
>> >> >>        }
>> >> >> 
>> >> >> On splitting root for storing 240, the pivot[split] is smaller than min.
>> >> >> This result a huge (pivot[split] - min).
>> >> >
>> >> >This is fine.
>> >> >
>> >> >There is an open work item to make it more accurate at higher levels,
>> >> >but it's not a problem as it is.
>> >> >
>> >> >Each level upwards needs a better 'minimum span', meaning that the node
>> >> >should have at least mas.min - mas.min + level * something.
>> >> 
>> >> Don't follow this. 
>> >> 
>> >> First, I guess it is mas.max - mas.min?
>> >> 
>> >
>> >No, the idea is that we could do better if we could keep the data more
>> >densely packed.
>> >
>> >If we have singletons (all range of 1), and we have 10,000 singletons,
>> >then a parent node with 0 - 128 and a sibling of 129-257 makes no sense,
>> >there is nothing that can be inserting into the parents last two slots
>> >- we can't split singletons smaller so we should have done a better job
>> >at splitting.
>> >
>> >If you think of how this grows today, we split then rebalance until the
>> >left node is full and we have to split again.  This could be better
>> >optimised to avoid multiple rebalances by doing something to push more
>> >data to the left based on the span of the children and the nodes.  But
>> >we'd want to keep enough so some modifications to each node doesn't
>> >cause a rebalance in the opposite direction.
>> >
>> >So a minimum span of parent nodes one level up from the leaves would be
>> >16 * 10 = 160, so when we split we can push more to the left to try and
>> >get to 160 while keeping the right node sufficient.  The math of the
>> >span would change based on how far up the tree we go, but would be
>> >easier to maintain.
>> >
>> >I'd probably still want to split it 50/50 the first time, but it might
>> >make sense to push more on the next rebalance.  so 10 => 6|5, 6|10 =>
>> >9|5, but maybe it's not worth the effort.  I'm starting to think so,
>> >considering it's been ineffective to date.  There is a chance that the
>> >active area gets moved to the more-full node and ends up splitting
>> >sooner anyways..
>> >
>> >All of this will probably change when dense nodes come anyways.
>> >
>> >> Then there is sth related to level. But I don't see level is used for
>> >> calculation. Would you mind sharing more your original thought?
>> >> 
>> >> >
>> >> >It works today for leaves, somewhat.
>> >> >
>> >> 
>> >> To me it works by coincidence.
>> >
>> >Seems like it.
>> >
>> >> 
>> >> The condition check is:
>> >> 
>> >>   (bn->pivot[split] - min) < (slot_count - 1) 
>> >> 
>> >> while slot_count is a constent, 16 for leaf node.
>> >> 
>> >> And we always pass 0 as min, the condition for leaf is:
>> >> 
>> >>   bn->pivot[split] < 15
>> >> 
>> >> For the mmap world, per my understanding, we won't expect an address is smaller
>> >> than 15.
>> >
>> >There are other users, but I think we should just drop this attempt at
>> >fancy stuff.
>> >
>> 
>> You mean "set mid_split = 0 and drop the min argument all together" mentioned
>> in another mail?
>
>Yes, please.  The more I think about this, the more I think we're
>wasting time trying to predict the future.  And, although fun, it
>usually doesn't work out.
>

Agree, will prepare another version with this.

Thanks
diff mbox series

Patch

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 894dc5e9436e..c2d4b188646c 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -3357,7 +3357,7 @@  static void mas_split(struct ma_state *mas, struct maple_big_node *b_node)
 		if (mas_push_data(mas, height, &mast, false))
 			break;
 
-		split = mab_calc_split(mas, b_node, &mid_split, prev_l_mas.min);
+		split = mab_calc_split(mas, b_node, &mid_split, mas->min);
 		mast_split_data(&mast, mas, split);
 		/*
 		 * Usually correct, mab_mas_cp in the above call overwrites