Message ID | 20230720102337.2069722-1-jaypatel@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC,v4] mm/slub: Optimize slub memory usage | expand |
On Thu, Jul 20, 2023 at 7:24 PM Jay Patel <jaypatel@linux.ibm.com> wrote: > > In the current implementation of the slub memory allocator, the slab > order selection process follows these criteria: > > 1) Determine the minimum order required to serve the minimum number of > objects (min_objects). This calculation is based on the formula (order > = min_objects * object_size / PAGE_SIZE). > 2) If the minimum order is greater than the maximum allowed order > (slub_max_order), set slub_max_order as the order for this slab. > 3) If the minimum order is less than the slub_max_order, iterate > through a loop from minimum order to slub_max_order and check if the > condition (rem <= slab_size / fract_leftover) holds true. Here, > slab_size is calculated as (PAGE_SIZE << order), rem is (slab_size % > object_size), and fract_leftover can have values of 16, 8, or 4. If > the condition is true, select that order for the slab. > > > However, in point 3, when calculating the fraction left over, it can > result in a large range of values (like 1 Kb to 256 bytes on 4K page > size & 4 Kb to 16 Kb on 64K page size with order 0 and goes on > increasing with higher order) when compared to the remainder (rem). This > can lead to the selection of an order that results in more memory > wastage. To mitigate such wastage, we have modified point 3 as follows: > To adjust the value of fract_leftover based on the page size, while > retaining the current value as the default for a 4K page size. > > Test results are as follows: > > 1) On 160 CPUs with 64K Page size > > +-----------------+----------------+----------------+ > | Total wastage in slub memory | > +-----------------+----------------+----------------+ > | | After Boot |After Hackbench | > | Normal | 932 Kb | 1812 Kb | > | With Patch | 729 Kb | 1636 Kb | > | Wastage reduce | ~22% | ~10% | > +-----------------+----------------+----------------+ > > +-----------------+----------------+----------------+ > | Total slub memory | > +-----------------+----------------+----------------+ > | | After Boot | After Hackbench| > | Normal | 1855296 | 2944576 | > | With Patch | 1544576 | 2692032 | > | Memory reduce | ~17% | ~9% | > +-----------------+----------------+----------------+ > > hackbench-process-sockets > +-------+-----+----------+----------+-----------+ > | Amean | 1 | 1.2727 | 1.2450 | ( 2.22%) | > | Amean | 4 | 1.6063 | 1.5810 | ( 1.60%) | > | Amean | 7 | 2.4190 | 2.3983 | ( 0.86%) | > | Amean | 12 | 3.9730 | 3.9347 | ( 0.97%) | > | Amean | 21 | 6.9823 | 6.8957 | ( 1.26%) | > | Amean | 30 | 10.1867 | 10.0600 | ( 1.26%) | > | Amean | 48 | 16.7490 | 16.4853 | ( 1.60%) | > | Amean | 79 | 28.1870 | 27.8673 | ( 1.15%) | > | Amean | 110 | 39.8363 | 39.3793 | ( 1.16%) | > | Amean | 141 | 51.5277 | 51.4907 | ( 0.07%) | > | Amean | 172 | 62.9700 | 62.7300 | ( 0.38%) | > | Amean | 203 | 74.5037 | 74.0630 | ( 0.59%) | > | Amean | 234 | 85.6560 | 85.3587 | ( 0.35%) | > | Amean | 265 | 96.9883 | 96.3770 | ( 0.63%) | > | Amean | 296 | 108.6893 | 108.0870 | ( 0.56%) | > +-------+-----+----------+----------+-----------+ > > 2) On 16 CPUs with 64K Page size > > +----------------+----------------+----------------+ > | Total wastage in slub memory | > +----------------+----------------+----------------+ > | | After Boot | After Hackbench| > | Normal | 273 Kb | 544 Kb | > | With Patch | 260 Kb | 500 Kb | > | Wastage reduce | ~5% | ~9% | > +----------------+----------------+----------------+ > > +-----------------+----------------+----------------+ > | Total slub memory | > +-----------------+----------------+----------------+ > | | After Boot | After Hackbench| > | Normal | 275840 | 412480 | > | With Patch | 272768 | 406208 | > | Memory reduce | ~1% | ~2% | > +-----------------+----------------+----------------+ > > hackbench-process-sockets > +-------+----+---------+---------+-----------+ > | Amean | 1 | 0.9513 | 0.9250 | ( 2.77%) | > | Amean | 4 | 2.9630 | 2.9570 | ( 0.20%) | > | Amean | 7 | 5.1780 | 5.1763 | ( 0.03%) | > | Amean | 12 | 8.8833 | 8.8817 | ( 0.02%) | > | Amean | 21 | 15.7577 | 15.6883 | ( 0.44%) | > | Amean | 30 | 22.2063 | 22.2843 | ( -0.35%) | > | Amean | 48 | 36.0587 | 36.1390 | ( -0.22%) | > | Amean | 64 | 49.7803 | 49.3457 | ( 0.87%) | > +-------+----+---------+---------+-----------+ > > Signed-off-by: Jay Patel <jaypatel@linux.ibm.com> > --- > Changes from V3 > 1) Resolved error and optimise logic for all arch > > Changes from V2 > 1) removed all page order selection logic for slab cache base on > wastage. > 2) Increasing fraction size base on page size (keeping current value > as default to 4K page) > > Changes from V1 > 1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then it > will return with PAGE_ALLOC_COSTLY_ORDER. > 2) Similarly, if min_objects * object_size < PAGE_SIZE, then it will > return with slub_min_order. > 3) Additionally, I changed slub_max_order to 2. There is no specific > reason for using the value 2, but it provided the best results in > terms of performance without any noticeable impact. > > mm/slub.c | 17 +++++++---------- > 1 file changed, 7 insertions(+), 10 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index c87628cd8a9a..8f6f38083b94 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -287,6 +287,7 @@ static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s) > #define OO_SHIFT 16 > #define OO_MASK ((1 << OO_SHIFT) - 1) > #define MAX_OBJS_PER_PAGE 32767 /* since slab.objects is u15 */ > +#define SLUB_PAGE_FRAC_SHIFT 12 > > /* Internal SLUB flags */ > /* Poison object */ > @@ -4117,6 +4118,7 @@ static inline int calculate_order(unsigned int size) > unsigned int min_objects; > unsigned int max_objects; > unsigned int nr_cpus; > + unsigned int page_size_frac; > > /* > * Attempt to find best configuration for a slab. This > @@ -4145,10 +4147,13 @@ static inline int calculate_order(unsigned int size) > max_objects = order_objects(slub_max_order, size); > min_objects = min(min_objects, max_objects); > > - while (min_objects > 1) { > + page_size_frac = ((PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT) == 1) ? 0 > + : PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT; > + > + while (min_objects >= 1) { > unsigned int fraction; > > - fraction = 16; > + fraction = 16 + page_size_frac; > while (fraction >= 4) { Sorry I'm a bit late for the review. IIRC hexagon/powerpc can have ridiculously large page sizes (1M or 256KB) (but I don't know if such config is actually used, tbh) so I think there should be an upper bound. > order = calc_slab_order(size, min_objects, > slub_max_order, fraction); > @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned int size) > min_objects--; > } > - /* > - * We were unable to place multiple objects in a slab. Now > - * lets see if we can place a single object there. > - */ > - order = calc_slab_order(size, 1, slub_max_order, 1); > - if (order <= slub_max_order) > - return order; I'm not sure if it's okay to remove this? It was fine in v2 because the least wasteful order was chosen regardless of fraction but that's not true anymore. Otherwise, everything looks fine to me. I'm too dumb to anticipate the outcome of increasing the slab order :P but this patch does not sound crazy to me. Thanks! -- Hyeonggon
On Fri, 2023-08-11 at 02:54 +0900, Hyeonggon Yoo wrote: > On Thu, Jul 20, 2023 at 7:24 PM Jay Patel <jaypatel@linux.ibm.com> > wrote: > > In the current implementation of the slub memory allocator, the > > slab > > order selection process follows these criteria: > > > > 1) Determine the minimum order required to serve the minimum number > > of > > objects (min_objects). This calculation is based on the formula > > (order > > = min_objects * object_size / PAGE_SIZE). > > 2) If the minimum order is greater than the maximum allowed order > > (slub_max_order), set slub_max_order as the order for this slab. > > 3) If the minimum order is less than the slub_max_order, iterate > > through a loop from minimum order to slub_max_order and check if > > the > > condition (rem <= slab_size / fract_leftover) holds true. Here, > > slab_size is calculated as (PAGE_SIZE << order), rem is (slab_size > > % > > object_size), and fract_leftover can have values of 16, 8, or 4. If > > the condition is true, select that order for the slab. > > > > > > However, in point 3, when calculating the fraction left over, it > > can > > result in a large range of values (like 1 Kb to 256 bytes on 4K > > page > > size & 4 Kb to 16 Kb on 64K page size with order 0 and goes on > > increasing with higher order) when compared to the remainder (rem). > > This > > can lead to the selection of an order that results in more memory > > wastage. To mitigate such wastage, we have modified point 3 as > > follows: > > To adjust the value of fract_leftover based on the page size, while > > retaining the current value as the default for a 4K page size. > > > > Test results are as follows: > > > > 1) On 160 CPUs with 64K Page size > > > > +-----------------+----------------+----------------+ > > > Total wastage in slub memory | > > +-----------------+----------------+----------------+ > > > | After Boot |After Hackbench | > > > Normal | 932 Kb | 1812 Kb | > > > With Patch | 729 Kb | 1636 Kb | > > > Wastage reduce | ~22% | ~10% | > > +-----------------+----------------+----------------+ > > > > +-----------------+----------------+----------------+ > > > Total slub memory | > > +-----------------+----------------+----------------+ > > > | After Boot | After Hackbench| > > > Normal | 1855296 | 2944576 | > > > With Patch | 1544576 | 2692032 | > > > Memory reduce | ~17% | ~9% | > > +-----------------+----------------+----------------+ > > > > hackbench-process-sockets > > +-------+-----+----------+----------+-----------+ > > > Amean | 1 | 1.2727 | 1.2450 | ( 2.22%) | > > > Amean | 4 | 1.6063 | 1.5810 | ( 1.60%) | > > > Amean | 7 | 2.4190 | 2.3983 | ( 0.86%) | > > > Amean | 12 | 3.9730 | 3.9347 | ( 0.97%) | > > > Amean | 21 | 6.9823 | 6.8957 | ( 1.26%) | > > > Amean | 30 | 10.1867 | 10.0600 | ( 1.26%) | > > > Amean | 48 | 16.7490 | 16.4853 | ( 1.60%) | > > > Amean | 79 | 28.1870 | 27.8673 | ( 1.15%) | > > > Amean | 110 | 39.8363 | 39.3793 | ( 1.16%) | > > > Amean | 141 | 51.5277 | 51.4907 | ( 0.07%) | > > > Amean | 172 | 62.9700 | 62.7300 | ( 0.38%) | > > > Amean | 203 | 74.5037 | 74.0630 | ( 0.59%) | > > > Amean | 234 | 85.6560 | 85.3587 | ( 0.35%) | > > > Amean | 265 | 96.9883 | 96.3770 | ( 0.63%) | > > > Amean | 296 | 108.6893 | 108.0870 | ( 0.56%) | > > +-------+-----+----------+----------+-----------+ > > > > 2) On 16 CPUs with 64K Page size > > > > +----------------+----------------+----------------+ > > > Total wastage in slub memory | > > +----------------+----------------+----------------+ > > > | After Boot | After Hackbench| > > > Normal | 273 Kb | 544 Kb | > > > With Patch | 260 Kb | 500 Kb | > > > Wastage reduce | ~5% | ~9% | > > +----------------+----------------+----------------+ > > > > +-----------------+----------------+----------------+ > > > Total slub memory | > > +-----------------+----------------+----------------+ > > > | After Boot | After Hackbench| > > > Normal | 275840 | 412480 | > > > With Patch | 272768 | 406208 | > > > Memory reduce | ~1% | ~2% | > > +-----------------+----------------+----------------+ > > > > hackbench-process-sockets > > +-------+----+---------+---------+-----------+ > > > Amean | 1 | 0.9513 | 0.9250 | ( 2.77%) | > > > Amean | 4 | 2.9630 | 2.9570 | ( 0.20%) | > > > Amean | 7 | 5.1780 | 5.1763 | ( 0.03%) | > > > Amean | 12 | 8.8833 | 8.8817 | ( 0.02%) | > > > Amean | 21 | 15.7577 | 15.6883 | ( 0.44%) | > > > Amean | 30 | 22.2063 | 22.2843 | ( -0.35%) | > > > Amean | 48 | 36.0587 | 36.1390 | ( -0.22%) | > > > Amean | 64 | 49.7803 | 49.3457 | ( 0.87%) | > > +-------+----+---------+---------+-----------+ > > > > Signed-off-by: Jay Patel <jaypatel@linux.ibm.com> > > --- > > Changes from V3 > > 1) Resolved error and optimise logic for all arch > > > > Changes from V2 > > 1) removed all page order selection logic for slab cache base on > > wastage. > > 2) Increasing fraction size base on page size (keeping current > > value > > as default to 4K page) > > > > Changes from V1 > > 1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then it > > will return with PAGE_ALLOC_COSTLY_ORDER. > > 2) Similarly, if min_objects * object_size < PAGE_SIZE, then it > > will > > return with slub_min_order. > > 3) Additionally, I changed slub_max_order to 2. There is no > > specific > > reason for using the value 2, but it provided the best results in > > terms of performance without any noticeable impact. > > > > mm/slub.c | 17 +++++++---------- > > 1 file changed, 7 insertions(+), 10 deletions(-) > > > > diff --git a/mm/slub.c b/mm/slub.c > > index c87628cd8a9a..8f6f38083b94 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -287,6 +287,7 @@ static inline bool > > kmem_cache_has_cpu_partial(struct kmem_cache *s) > > #define OO_SHIFT 16 > > #define OO_MASK ((1 << OO_SHIFT) - 1) > > #define MAX_OBJS_PER_PAGE 32767 /* since slab.objects is u15 > > */ > > +#define SLUB_PAGE_FRAC_SHIFT 12 > > > > /* Internal SLUB flags */ > > /* Poison object */ > > @@ -4117,6 +4118,7 @@ static inline int calculate_order(unsigned > > int size) > > unsigned int min_objects; > > unsigned int max_objects; > > unsigned int nr_cpus; > > + unsigned int page_size_frac; > > > > /* > > * Attempt to find best configuration for a slab. This > > @@ -4145,10 +4147,13 @@ static inline int calculate_order(unsigned > > int size) > > max_objects = order_objects(slub_max_order, size); > > min_objects = min(min_objects, max_objects); > > > > - while (min_objects > 1) { > > + page_size_frac = ((PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT) == 1) > > ? 0 > > + : PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT; > > + > > + while (min_objects >= 1) { > > unsigned int fraction; > > > > - fraction = 16; > > + fraction = 16 + page_size_frac; > > while (fraction >= 4) { > > Sorry I'm a bit late for the review. > > IIRC hexagon/powerpc can have ridiculously large page sizes (1M or > 256KB) > (but I don't know if such config is actually used, tbh) so I think > there should be > an upper bound. Hi, I think that might not be required as arch with larger page size will required larger fraction value as per this exit condition (rem <= slab_size / fract_leftover) during calc_slab_order. > > > order = calc_slab_order(size, min_objects, > > slub_max_order, fraction); > > @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned > > int size) > > min_objects--; > > } > > - /* > > - * We were unable to place multiple objects in a slab. Now > > - * lets see if we can place a single object there. > > - */ > > - order = calc_slab_order(size, 1, slub_max_order, 1); > > - if (order <= slub_max_order) > > - return order; > > I'm not sure if it's okay to remove this? > It was fine in v2 because the least wasteful order was chosen > regardless of fraction but that's not true anymore. > Ok, So my though are like if single object in slab with slab_size = PAGE_SIZE << slub_max_order and it wastage more then 1\4th of slab_size then it's better to skip this part and use MAX_ORDER instead of slub_max_order. Could you kindly share your perspective on this part? Tha nks Jay Patel > Otherwise, everything looks fine to me. I'm too dumb to anticipate > the outcome of increasing the slab order :P but this patch does not > sound crazy to me. > > Thanks! > -- > Hyeonggon
On 8/10/23 19:54, Hyeonggon Yoo wrote: >> order = calc_slab_order(size, min_objects, >> slub_max_order, fraction); >> @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned int size) >> min_objects--; >> } >> - /* >> - * We were unable to place multiple objects in a slab. Now >> - * lets see if we can place a single object there. >> - */ >> - order = calc_slab_order(size, 1, slub_max_order, 1); >> - if (order <= slub_max_order) >> - return order; > > I'm not sure if it's okay to remove this? > It was fine in v2 because the least wasteful order was chosen > regardless of fraction but that's not true anymore. > > Otherwise, everything looks fine to me. I'm too dumb to anticipate > the outcome of increasing the slab order :P but this patch does not > sound crazy to me. I wanted to have a better idea how the orders change so I hacked up a patch to print them for all sizes up to 1MB (unnecessarily large I guess) and also for various page sizes and nr_cpus (that's however rather invasive and prone to me missing some helper being used that still relies on real PAGE_SHIFT), then I applied v4 (needed some conflict fixups with my hack) on top: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-orders As expected, things didn't change with 4k PAGE_SIZE. With 64k PAGE_SIZE, I thought the patch in v4 form would result in lower orders, but seems not always? I.e. I can see before the patch: Calculated slab orders for page_shift 16 nr_cpus 1: 8 0 4376 1 (so until 4368 bytes it keeps order at 0) And after: 8 0 2264 1 2272 0 2344 1 2352 0 2432 1 Not sure this kind of "oscillation" is helpful with a small machine (1CPU), and 64kB pages so the unused part of page is quite small. With 16 cpus, AFAICS the orders are also larger for some sizes. Hm but you reported reduction of total slab memory which suggests lower orders were selected somewhere, so maybe I did some mistake. Anyway my point here is that this evaluation approach might be useful, even if it's a non-upstreamable hack, and some postprocessing of the output is needed for easier comparison of before/after, so feel free to try that out. BTW I'll be away for 2 weeks from now, so further feedback will have to come from others in that time... > Thanks! > -- > Hyeonggon
On Fri, Aug 11, 2023 at 3:52 PM Jay Patel <jaypatel@linux.ibm.com> wrote: > > On Fri, 2023-08-11 at 02:54 +0900, Hyeonggon Yoo wrote: > > On Thu, Jul 20, 2023 at 7:24 PM Jay Patel <jaypatel@linux.ibm.com> > > wrote: > > > In the current implementation of the slub memory allocator, the > > > slab > > > order selection process follows these criteria: > > > > > > 1) Determine the minimum order required to serve the minimum number > > > of > > > objects (min_objects). This calculation is based on the formula > > > (order > > > = min_objects * object_size / PAGE_SIZE). > > > 2) If the minimum order is greater than the maximum allowed order > > > (slub_max_order), set slub_max_order as the order for this slab. > > > 3) If the minimum order is less than the slub_max_order, iterate > > > through a loop from minimum order to slub_max_order and check if > > > the > > > condition (rem <= slab_size / fract_leftover) holds true. Here, > > > slab_size is calculated as (PAGE_SIZE << order), rem is (slab_size > > > % > > > object_size), and fract_leftover can have values of 16, 8, or 4. If > > > the condition is true, select that order for the slab. > > > > > > > > > However, in point 3, when calculating the fraction left over, it > > > can > > > result in a large range of values (like 1 Kb to 256 bytes on 4K > > > page > > > size & 4 Kb to 16 Kb on 64K page size with order 0 and goes on > > > increasing with higher order) when compared to the remainder (rem). > > > This > > > can lead to the selection of an order that results in more memory > > > wastage. To mitigate such wastage, we have modified point 3 as > > > follows: > > > To adjust the value of fract_leftover based on the page size, while > > > retaining the current value as the default for a 4K page size. > > > > > > Test results are as follows: > > > > > > 1) On 160 CPUs with 64K Page size > > > > > > +-----------------+----------------+----------------+ > > > > Total wastage in slub memory | > > > +-----------------+----------------+----------------+ > > > > | After Boot |After Hackbench | > > > > Normal | 932 Kb | 1812 Kb | > > > > With Patch | 729 Kb | 1636 Kb | > > > > Wastage reduce | ~22% | ~10% | > > > +-----------------+----------------+----------------+ > > > > > > +-----------------+----------------+----------------+ > > > > Total slub memory | > > > +-----------------+----------------+----------------+ > > > > | After Boot | After Hackbench| > > > > Normal | 1855296 | 2944576 | > > > > With Patch | 1544576 | 2692032 | > > > > Memory reduce | ~17% | ~9% | > > > +-----------------+----------------+----------------+ > > > > > > hackbench-process-sockets > > > +-------+-----+----------+----------+-----------+ > > > > Amean | 1 | 1.2727 | 1.2450 | ( 2.22%) | > > > > Amean | 4 | 1.6063 | 1.5810 | ( 1.60%) | > > > > Amean | 7 | 2.4190 | 2.3983 | ( 0.86%) | > > > > Amean | 12 | 3.9730 | 3.9347 | ( 0.97%) | > > > > Amean | 21 | 6.9823 | 6.8957 | ( 1.26%) | > > > > Amean | 30 | 10.1867 | 10.0600 | ( 1.26%) | > > > > Amean | 48 | 16.7490 | 16.4853 | ( 1.60%) | > > > > Amean | 79 | 28.1870 | 27.8673 | ( 1.15%) | > > > > Amean | 110 | 39.8363 | 39.3793 | ( 1.16%) | > > > > Amean | 141 | 51.5277 | 51.4907 | ( 0.07%) | > > > > Amean | 172 | 62.9700 | 62.7300 | ( 0.38%) | > > > > Amean | 203 | 74.5037 | 74.0630 | ( 0.59%) | > > > > Amean | 234 | 85.6560 | 85.3587 | ( 0.35%) | > > > > Amean | 265 | 96.9883 | 96.3770 | ( 0.63%) | > > > > Amean | 296 | 108.6893 | 108.0870 | ( 0.56%) | > > > +-------+-----+----------+----------+-----------+ > > > > > > 2) On 16 CPUs with 64K Page size > > > > > > +----------------+----------------+----------------+ > > > > Total wastage in slub memory | > > > +----------------+----------------+----------------+ > > > > | After Boot | After Hackbench| > > > > Normal | 273 Kb | 544 Kb | > > > > With Patch | 260 Kb | 500 Kb | > > > > Wastage reduce | ~5% | ~9% | > > > +----------------+----------------+----------------+ > > > > > > +-----------------+----------------+----------------+ > > > > Total slub memory | > > > +-----------------+----------------+----------------+ > > > > | After Boot | After Hackbench| > > > > Normal | 275840 | 412480 | > > > > With Patch | 272768 | 406208 | > > > > Memory reduce | ~1% | ~2% | > > > +-----------------+----------------+----------------+ > > > > > > hackbench-process-sockets > > > +-------+----+---------+---------+-----------+ > > > > Amean | 1 | 0.9513 | 0.9250 | ( 2.77%) | > > > > Amean | 4 | 2.9630 | 2.9570 | ( 0.20%) | > > > > Amean | 7 | 5.1780 | 5.1763 | ( 0.03%) | > > > > Amean | 12 | 8.8833 | 8.8817 | ( 0.02%) | > > > > Amean | 21 | 15.7577 | 15.6883 | ( 0.44%) | > > > > Amean | 30 | 22.2063 | 22.2843 | ( -0.35%) | > > > > Amean | 48 | 36.0587 | 36.1390 | ( -0.22%) | > > > > Amean | 64 | 49.7803 | 49.3457 | ( 0.87%) | > > > +-------+----+---------+---------+-----------+ > > > > > > Signed-off-by: Jay Patel <jaypatel@linux.ibm.com> > > > --- > > > Changes from V3 > > > 1) Resolved error and optimise logic for all arch > > > > > > Changes from V2 > > > 1) removed all page order selection logic for slab cache base on > > > wastage. > > > 2) Increasing fraction size base on page size (keeping current > > > value > > > as default to 4K page) > > > > > > Changes from V1 > > > 1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then it > > > will return with PAGE_ALLOC_COSTLY_ORDER. > > > 2) Similarly, if min_objects * object_size < PAGE_SIZE, then it > > > will > > > return with slub_min_order. > > > 3) Additionally, I changed slub_max_order to 2. There is no > > > specific > > > reason for using the value 2, but it provided the best results in > > > terms of performance without any noticeable impact. > > > > > > mm/slub.c | 17 +++++++---------- > > > 1 file changed, 7 insertions(+), 10 deletions(-) > > > > > > diff --git a/mm/slub.c b/mm/slub.c > > > index c87628cd8a9a..8f6f38083b94 100644 > > > --- a/mm/slub.c > > > +++ b/mm/slub.c > > > @@ -287,6 +287,7 @@ static inline bool > > > kmem_cache_has_cpu_partial(struct kmem_cache *s) > > > #define OO_SHIFT 16 > > > #define OO_MASK ((1 << OO_SHIFT) - 1) > > > #define MAX_OBJS_PER_PAGE 32767 /* since slab.objects is u15 > > > */ > > > +#define SLUB_PAGE_FRAC_SHIFT 12 > > > > > > /* Internal SLUB flags */ > > > /* Poison object */ > > > @@ -4117,6 +4118,7 @@ static inline int calculate_order(unsigned > > > int size) > > > unsigned int min_objects; > > > unsigned int max_objects; > > > unsigned int nr_cpus; > > > + unsigned int page_size_frac; > > > > > > /* > > > * Attempt to find best configuration for a slab. This > > > @@ -4145,10 +4147,13 @@ static inline int calculate_order(unsigned > > > int size) > > > max_objects = order_objects(slub_max_order, size); > > > min_objects = min(min_objects, max_objects); > > > > > > - while (min_objects > 1) { > > > + page_size_frac = ((PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT) == 1) > > > ? 0 > > > + : PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT; > > > + > > > + while (min_objects >= 1) { > > > unsigned int fraction; > > > > > > - fraction = 16; > > > + fraction = 16 + page_size_frac; > > > while (fraction >= 4) { > > > > Sorry I'm a bit late for the review. > > > > IIRC hexagon/powerpc can have ridiculously large page sizes (1M or > > 256KB) > > (but I don't know if such config is actually used, tbh) so I think > > there should be > > an upper bound. > > Hi, > I think that might not be required as arch with larger page size > will required larger fraction value as per this exit condition (rem <= > slab_size / fract_leftover) during calc_slab_order. Okay, with 256KB pages the fraction will start from 80, and then 40, 20, 10, 5, ... and 1/80 of 256KB is about 3KB. So it's to waste less even when the machine uses large page sizes, because 1/16 of 256KB is still large, right? > > > order = calc_slab_order(size, min_objects, > > > slub_max_order, fraction); > > > @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned > > > int size) > > > min_objects--; > > > } > > > - /* > > > - * We were unable to place multiple objects in a slab. Now > > > - * lets see if we can place a single object there. > > > - */ > > > - order = calc_slab_order(size, 1, slub_max_order, 1); > > > - if (order <= slub_max_order) > > > - return order; > > > > I'm not sure if it's okay to remove this? > > It was fine in v2 because the least wasteful order was chosen > > regardless of fraction but that's not true anymore. > > > Ok, So my though are like if single object in slab with slab_size = > PAGE_SIZE << slub_max_order and it wastage more then 1\4th of slab_size > then it's better to skip this part and use MAX_ORDER instead of > slub_max_order. > Could you kindly share your perspective on this part? I simply missed that part! :) That looks fine to me. > Tha > nks > Jay Patel > > Otherwise, everything looks fine to me. I'm too dumb to anticipate > > the outcome of increasing the slab order :P but this patch does not > > sound crazy to me. > > > > Thanks! > > -- > > Hyeonggon >
On Fri, 2023-08-18 at 14:11 +0900, Hyeonggon Yoo wrote: > On Fri, Aug 11, 2023 at 3:52 PM Jay Patel <jaypatel@linux.ibm.com> > wrote: > > On Fri, 2023-08-11 at 02:54 +0900, Hyeonggon Yoo wrote: > > > On Thu, Jul 20, 2023 at 7:24 PM Jay Patel <jaypatel@linux.ibm.com > > > > > > > wrote: > > > > In the current implementation of the slub memory allocator, the > > > > slab > > > > order selection process follows these criteria: > > > > > > > > 1) Determine the minimum order required to serve the minimum > > > > number > > > > of > > > > objects (min_objects). This calculation is based on the formula > > > > (order > > > > = min_objects * object_size / PAGE_SIZE). > > > > 2) If the minimum order is greater than the maximum allowed > > > > order > > > > (slub_max_order), set slub_max_order as the order for this > > > > slab. > > > > 3) If the minimum order is less than the slub_max_order, > > > > iterate > > > > through a loop from minimum order to slub_max_order and check > > > > if > > > > the > > > > condition (rem <= slab_size / fract_leftover) holds true. Here, > > > > slab_size is calculated as (PAGE_SIZE << order), rem is > > > > (slab_size > > > > % > > > > object_size), and fract_leftover can have values of 16, 8, or > > > > 4. If > > > > the condition is true, select that order for the slab. > > > > > > > > > > > > However, in point 3, when calculating the fraction left over, > > > > it > > > > can > > > > result in a large range of values (like 1 Kb to 256 bytes on 4K > > > > page > > > > size & 4 Kb to 16 Kb on 64K page size with order 0 and goes on > > > > increasing with higher order) when compared to the remainder > > > > (rem). > > > > This > > > > can lead to the selection of an order that results in more > > > > memory > > > > wastage. To mitigate such wastage, we have modified point 3 as > > > > follows: > > > > To adjust the value of fract_leftover based on the page size, > > > > while > > > > retaining the current value as the default for a 4K page size. > > > > > > > > Test results are as follows: > > > > > > > > 1) On 160 CPUs with 64K Page size > > > > > > > > +-----------------+----------------+----------------+ > > > > > Total wastage in slub memory | > > > > +-----------------+----------------+----------------+ > > > > > | After Boot |After Hackbench | > > > > > Normal | 932 Kb | 1812 Kb | > > > > > With Patch | 729 Kb | 1636 Kb | > > > > > Wastage reduce | ~22% | ~10% | > > > > +-----------------+----------------+----------------+ > > > > > > > > +-----------------+----------------+----------------+ > > > > > Total slub memory | > > > > +-----------------+----------------+----------------+ > > > > > | After Boot | After Hackbench| > > > > > Normal | 1855296 | 2944576 | > > > > > With Patch | 1544576 | 2692032 | > > > > > Memory reduce | ~17% | ~9% | > > > > +-----------------+----------------+----------------+ > > > > > > > > hackbench-process-sockets > > > > +-------+-----+----------+----------+-----------+ > > > > > Amean | 1 | 1.2727 | 1.2450 | ( 2.22%) | > > > > > Amean | 4 | 1.6063 | 1.5810 | ( 1.60%) | > > > > > Amean | 7 | 2.4190 | 2.3983 | ( 0.86%) | > > > > > Amean | 12 | 3.9730 | 3.9347 | ( 0.97%) | > > > > > Amean | 21 | 6.9823 | 6.8957 | ( 1.26%) | > > > > > Amean | 30 | 10.1867 | 10.0600 | ( 1.26%) | > > > > > Amean | 48 | 16.7490 | 16.4853 | ( 1.60%) | > > > > > Amean | 79 | 28.1870 | 27.8673 | ( 1.15%) | > > > > > Amean | 110 | 39.8363 | 39.3793 | ( 1.16%) | > > > > > Amean | 141 | 51.5277 | 51.4907 | ( 0.07%) | > > > > > Amean | 172 | 62.9700 | 62.7300 | ( 0.38%) | > > > > > Amean | 203 | 74.5037 | 74.0630 | ( 0.59%) | > > > > > Amean | 234 | 85.6560 | 85.3587 | ( 0.35%) | > > > > > Amean | 265 | 96.9883 | 96.3770 | ( 0.63%) | > > > > > Amean | 296 | 108.6893 | 108.0870 | ( 0.56%) | > > > > +-------+-----+----------+----------+-----------+ > > > > > > > > 2) On 16 CPUs with 64K Page size > > > > > > > > +----------------+----------------+----------------+ > > > > > Total wastage in slub memory | > > > > +----------------+----------------+----------------+ > > > > > | After Boot | After Hackbench| > > > > > Normal | 273 Kb | 544 Kb | > > > > > With Patch | 260 Kb | 500 Kb | > > > > > Wastage reduce | ~5% | ~9% | > > > > +----------------+----------------+----------------+ > > > > > > > > +-----------------+----------------+----------------+ > > > > > Total slub memory | > > > > +-----------------+----------------+----------------+ > > > > > | After Boot | After Hackbench| > > > > > Normal | 275840 | 412480 | > > > > > With Patch | 272768 | 406208 | > > > > > Memory reduce | ~1% | ~2% | > > > > +-----------------+----------------+----------------+ > > > > > > > > hackbench-process-sockets > > > > +-------+----+---------+---------+-----------+ > > > > > Amean | 1 | 0.9513 | 0.9250 | ( 2.77%) | > > > > > Amean | 4 | 2.9630 | 2.9570 | ( 0.20%) | > > > > > Amean | 7 | 5.1780 | 5.1763 | ( 0.03%) | > > > > > Amean | 12 | 8.8833 | 8.8817 | ( 0.02%) | > > > > > Amean | 21 | 15.7577 | 15.6883 | ( 0.44%) | > > > > > Amean | 30 | 22.2063 | 22.2843 | ( -0.35%) | > > > > > Amean | 48 | 36.0587 | 36.1390 | ( -0.22%) | > > > > > Amean | 64 | 49.7803 | 49.3457 | ( 0.87%) | > > > > +-------+----+---------+---------+-----------+ > > > > > > > > Signed-off-by: Jay Patel <jaypatel@linux.ibm.com> > > > > --- > > > > Changes from V3 > > > > 1) Resolved error and optimise logic for all arch > > > > > > > > Changes from V2 > > > > 1) removed all page order selection logic for slab cache base > > > > on > > > > wastage. > > > > 2) Increasing fraction size base on page size (keeping current > > > > value > > > > as default to 4K page) > > > > > > > > Changes from V1 > > > > 1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then > > > > it > > > > will return with PAGE_ALLOC_COSTLY_ORDER. > > > > 2) Similarly, if min_objects * object_size < PAGE_SIZE, then it > > > > will > > > > return with slub_min_order. > > > > 3) Additionally, I changed slub_max_order to 2. There is no > > > > specific > > > > reason for using the value 2, but it provided the best results > > > > in > > > > terms of performance without any noticeable impact. > > > > > > > > mm/slub.c | 17 +++++++---------- > > > > 1 file changed, 7 insertions(+), 10 deletions(-) > > > > > > > > diff --git a/mm/slub.c b/mm/slub.c > > > > index c87628cd8a9a..8f6f38083b94 100644 > > > > --- a/mm/slub.c > > > > +++ b/mm/slub.c > > > > @@ -287,6 +287,7 @@ static inline bool > > > > kmem_cache_has_cpu_partial(struct kmem_cache *s) > > > > #define OO_SHIFT 16 > > > > #define OO_MASK ((1 << OO_SHIFT) - 1) > > > > #define MAX_OBJS_PER_PAGE 32767 /* since slab.objects is > > > > u15 > > > > */ > > > > +#define SLUB_PAGE_FRAC_SHIFT 12 > > > > > > > > /* Internal SLUB flags */ > > > > /* Poison object */ > > > > @@ -4117,6 +4118,7 @@ static inline int > > > > calculate_order(unsigned > > > > int size) > > > > unsigned int min_objects; > > > > unsigned int max_objects; > > > > unsigned int nr_cpus; > > > > + unsigned int page_size_frac; > > > > > > > > /* > > > > * Attempt to find best configuration for a slab. This > > > > @@ -4145,10 +4147,13 @@ static inline int > > > > calculate_order(unsigned > > > > int size) > > > > max_objects = order_objects(slub_max_order, size); > > > > min_objects = min(min_objects, max_objects); > > > > > > > > - while (min_objects > 1) { > > > > + page_size_frac = ((PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT) > > > > == 1) > > > > ? 0 > > > > + : PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT; > > > > + > > > > + while (min_objects >= 1) { > > > > unsigned int fraction; > > > > > > > > - fraction = 16; > > > > + fraction = 16 + page_size_frac; > > > > while (fraction >= 4) { > > > > > > Sorry I'm a bit late for the review. > > > > > > IIRC hexagon/powerpc can have ridiculously large page sizes (1M > > > or > > > 256KB) > > > (but I don't know if such config is actually used, tbh) so I > > > think > > > there should be > > > an upper bound. > > > > Hi, > > I think that might not be required as arch with larger page size > > will required larger fraction value as per this exit condition (rem > > <= > > slab_size / fract_leftover) during calc_slab_order. > > Okay, with 256KB pages the fraction will start from 80, and then 40, > 20, 10, 5, ... > and 1/80 of 256KB is about 3KB. So it's to waste less even when the > machine uses large page sizes, > because 1/16 of 256KB is still large, right? Yes correct, so with this approach we can save memory wastage and total memory for slub when using larger page size :) > > > > > order = calc_slab_order(size, > > > > min_objects, > > > > slub_max_order, > > > > fraction); > > > > @@ -4159,14 +4164,6 @@ static inline int > > > > calculate_order(unsigned > > > > int size) > > > > min_objects--; > > > > } > > > > - /* > > > > - * We were unable to place multiple objects in a slab. > > > > Now > > > > - * lets see if we can place a single object there. > > > > - */ > > > > - order = calc_slab_order(size, 1, slub_max_order, 1); > > > > - if (order <= slub_max_order) > > > > - return order; > > > > > > I'm not sure if it's okay to remove this? > > > It was fine in v2 because the least wasteful order was chosen > > > regardless of fraction but that's not true anymore. > > > > > Ok, So my though are like if single object in slab with slab_size = > > PAGE_SIZE << slub_max_order and it wastage more then 1\4th of > > slab_size > > then it's better to skip this part and use MAX_ORDER instead of > > slub_max_order. > > Could you kindly share your perspective on this part? > > I simply missed that part! :) > That looks fine to me. > > > > Tha > > nks > > Jay Patel > > > Otherwise, everything looks fine to me. I'm too dumb to > > > anticipate > > > the outcome of increasing the slab order :P but this patch does > > > not > > > sound crazy to me. > > > > > > Thanks! > > > -- > > > Hyeonggon
On Fri, 2023-08-11 at 17:43 +0200, Vlastimil Babka wrote: > On 8/10/23 19:54, Hyeonggon Yoo wrote: > > > order = calc_slab_order(size, > > > min_objects, > > > slub_max_order, > > > fraction); > > > @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned > > > int size) > > > min_objects--; > > > } > > > - /* > > > - * We were unable to place multiple objects in a slab. > > > Now > > > - * lets see if we can place a single object there. > > > - */ > > > - order = calc_slab_order(size, 1, slub_max_order, 1); > > > - if (order <= slub_max_order) > > > - return order; > > > > I'm not sure if it's okay to remove this? > > It was fine in v2 because the least wasteful order was chosen > > regardless of fraction but that's not true anymore. > > > > Otherwise, everything looks fine to me. I'm too dumb to anticipate > > the outcome of increasing the slab order :P but this patch does not > > sound crazy to me. > > I wanted to have a better idea how the orders change so I hacked up a > patch > to print them for all sizes up to 1MB (unnecessarily large I guess) > and also > for various page sizes and nr_cpus (that's however rather invasive > and prone > to me missing some helper being used that still relies on real > PAGE_SHIFT), > then I applied v4 (needed some conflict fixups with my hack) on top: > > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-orders > > As expected, things didn't change with 4k PAGE_SIZE. With 64k > PAGE_SIZE, I > thought the patch in v4 form would result in lower orders, but seems > not always? > > I.e. I can see before the patch: > > Calculated slab orders for page_shift 16 nr_cpus 1: > 8 0 > 4376 1 > > (so until 4368 bytes it keeps order at 0) > > And after: > 8 0 > 2264 1 > 2272 0 > 2344 1 > 2352 0 > 2432 1 > > Not sure this kind of "oscillation" is helpful with a small machine > (1CPU), > and 64kB pages so the unused part of page is quite small. > Hi Vlastimil, With patch. it will cause the fraction_size to rise to 32 when utilizing a 64k page size. As a result, the maximum wastage cap for each slab cache will be 2k (64k divided by 32). Any object size exceeding this cap will be moved to order 1 or beyond due to which this oscillation is seen. > With 16 cpus, AFAICS the orders are also larger for some sizes. > Hm but you reported reduction of total slab memory which suggests > lower > orders were selected somewhere, so maybe I did some mistake.A AFAIK total slab memory is reduce because of two reason (with this patch for larger page size) 1) order for some slab cache is reduce (by increasing fraction_size) 2) Have also seen reduction in overall slab cache numbers as because of increasing page order > > Anyway my point here is that this evaluation approach might be > useful, even > if it's a non-upstreamable hack, and some postprocessing of the > output is > needed for easier comparison of before/after, so feel free to try > that out. Thank you for this details test :) > > BTW I'll be away for 2 weeks from now, so further feedback will have > to come > from others in that time... > Do we have any additional feedback from others on the same matter? Thank Jay Patel > > Thanks! > > -- > > Hyeonggon
On 8/24/23 12:52, Jay Patel wrote: > On Fri, 2023-08-11 at 17:43 +0200, Vlastimil Babka wrote: >> On 8/10/23 19:54, Hyeonggon Yoo wrote: >> > > order = calc_slab_order(size, >> > > min_objects, >> > > slub_max_order, >> > > fraction); >> > > @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned >> > > int size) >> > > min_objects--; >> > > } >> > > - /* >> > > - * We were unable to place multiple objects in a slab. >> > > Now >> > > - * lets see if we can place a single object there. >> > > - */ >> > > - order = calc_slab_order(size, 1, slub_max_order, 1); >> > > - if (order <= slub_max_order) >> > > - return order; >> > >> > I'm not sure if it's okay to remove this? >> > It was fine in v2 because the least wasteful order was chosen >> > regardless of fraction but that's not true anymore. >> > >> > Otherwise, everything looks fine to me. I'm too dumb to anticipate >> > the outcome of increasing the slab order :P but this patch does not >> > sound crazy to me. >> >> I wanted to have a better idea how the orders change so I hacked up a >> patch >> to print them for all sizes up to 1MB (unnecessarily large I guess) >> and also >> for various page sizes and nr_cpus (that's however rather invasive >> and prone >> to me missing some helper being used that still relies on real >> PAGE_SHIFT), >> then I applied v4 (needed some conflict fixups with my hack) on top: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-orders >> >> As expected, things didn't change with 4k PAGE_SIZE. With 64k >> PAGE_SIZE, I >> thought the patch in v4 form would result in lower orders, but seems >> not always? >> >> I.e. I can see before the patch: >> >> Calculated slab orders for page_shift 16 nr_cpus 1: >> 8 0 >> 4376 1 >> >> (so until 4368 bytes it keeps order at 0) >> >> And after: >> 8 0 >> 2264 1 >> 2272 0 >> 2344 1 >> 2352 0 >> 2432 1 >> >> Not sure this kind of "oscillation" is helpful with a small machine >> (1CPU), >> and 64kB pages so the unused part of page is quite small. >> > Hi Vlastimil, > > With patch. it will cause the fraction_size to rise to 32 > when utilizing a 64k page size. As a result, the maximum wastage cap > for each slab cache will be 2k (64k divided by 32). Any object size > exceeding this cap will be moved to order 1 or beyond due to which this > oscillation is seen. Hi, sorry for the late reply. >> With 16 cpus, AFAICS the orders are also larger for some sizes. >> Hm but you reported reduction of total slab memory which suggests >> lower >> orders were selected somewhere, so maybe I did some mistake.A > > AFAIK total slab memory is reduce because of two reason (with this > patch for larger page size) > 1) order for some slab cache is reduce (by increasing fraction_size) How can increased fraction_size ever result in a lower order? I think it can only result in increased order (or same order). And the simulations with my hack patch don't seem to counter example that. Note previously I did expect the order to be lower (or same) and was surprised by my results, but now I realized I misunderstood the v4 patch. > 2) Have also seen reduction in overall slab cache numbers as because of > increasing page order I think your results might be just due to randomness and could turn out different with repeating the test, or converge to be the same if you average multiple runs. You posted them for "160 CPUs with 64K Page size" and if I add that combination to my hack print, I see the same result before and after your patch: Calculated slab orders for page_shift 16 nr_cpus 160: 8 0 1824 1 3648 2 7288 3 174768 2 196608 3 524296 4 Still, I might have a bug there. Can you confirm there are actual differences with a /proc/slabinfo before/after your patch? If there are none, any differences observed have to be due to randomness, not differences in order. Going back to the idea behind your patch, I don't think it makes sense to try increase the fraction only for higher-orders. Yes, with 1/16 fraction, the waste with 64kB page can be 4kB, while with 1/32 it will be just 2kB, and with 4kB this is only 256 vs 128bytes. However the object sizes and counts don't differ with page size, so with 4kB pages we'll have more slabs to host the same number of objects, and the waste will accumulate accordingly - i.e. the fraction metric should be independent of page size wrt resulting total kilobytes of waste. So maybe the only thing we need to do is to try setting it to 32 initial value instead of 16 regardless of page size. That should hopefully again show a good tradeoff for 4kB as one of the earlier versions, while on 64kB it shouldn't cause much difference (again, none at all with 160 cpus, some difference with less than 128 cpus, if my simulations were correct). >> >> Anyway my point here is that this evaluation approach might be >> useful, even >> if it's a non-upstreamable hack, and some postprocessing of the >> output is >> needed for easier comparison of before/after, so feel free to try >> that out. > > Thank you for this details test :) >> >> BTW I'll be away for 2 weeks from now, so further feedback will have >> to come >> from others in that time... >> > Do we have any additional feedback from others on the same matter? > > Thank > > Jay Patel >> > Thanks! >> > -- >> > Hyeonggon > >
On Thu, 2023-09-07 at 15:42 +0200, Vlastimil Babka wrote: > On 8/24/23 12:52, Jay Patel wrote: > > On Fri, 2023-08-11 at 17:43 +0200, Vlastimil Babka wrote: > > > On 8/10/23 19:54, Hyeonggon Yoo wrote: > > > > > order = calc_slab_order(size, > > > > > min_objects, > > > > > slub_max_order, > > > > > fraction); > > > > > @@ -4159,14 +4164,6 @@ static inline int > > > > > calculate_order(unsigned > > > > > int size) > > > > > min_objects--; > > > > > } > > > > > - /* > > > > > - * We were unable to place multiple objects in a > > > > > slab. > > > > > Now > > > > > - * lets see if we can place a single object there. > > > > > - */ > > > > > - order = calc_slab_order(size, 1, slub_max_order, 1); > > > > > - if (order <= slub_max_order) > > > > > - return order; > > > > > > > > I'm not sure if it's okay to remove this? > > > > It was fine in v2 because the least wasteful order was chosen > > > > regardless of fraction but that's not true anymore. > > > > > > > > Otherwise, everything looks fine to me. I'm too dumb to > > > > anticipate > > > > the outcome of increasing the slab order :P but this patch does > > > > not > > > > sound crazy to me. > > > > > > I wanted to have a better idea how the orders change so I hacked > > > up a > > > patch > > > to print them for all sizes up to 1MB (unnecessarily large I > > > guess) > > > and also > > > for various page sizes and nr_cpus (that's however rather > > > invasive > > > and prone > > > to me missing some helper being used that still relies on real > > > PAGE_SHIFT), > > > then I applied v4 (needed some conflict fixups with my hack) on > > > top: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-orders > > > > > > As expected, things didn't change with 4k PAGE_SIZE. With 64k > > > PAGE_SIZE, I > > > thought the patch in v4 form would result in lower orders, but > > > seems > > > not always? > > > > > > I.e. I can see before the patch: > > > > > > Calculated slab orders for page_shift 16 nr_cpus 1: > > > 8 0 > > > 4376 1 > > > > > > (so until 4368 bytes it keeps order at 0) > > > > > > And after: > > > 8 0 > > > 2264 1 > > > 2272 0 > > > 2344 1 > > > 2352 0 > > > 2432 1 > > > > > > Not sure this kind of "oscillation" is helpful with a small > > > machine > > > (1CPU), > > > and 64kB pages so the unused part of page is quite small. > > > > > Hi Vlastimil, > > > > With patch. it will cause the fraction_size to rise to 32 > > when utilizing a 64k page size. As a result, the maximum wastage > > cap > > for each slab cache will be 2k (64k divided by 32). Any object size > > exceeding this cap will be moved to order 1 or beyond due to which > > this > > oscillation is seen. > > Hi, sorry for the late reply. > > > > With 16 cpus, AFAICS the orders are also larger for some sizes. > > > Hm but you reported reduction of total slab memory which suggests > > > lower > > > orders were selected somewhere, so maybe I did some mistake.A > > > > AFAIK total slab memory is reduce because of two reason (with this > > patch for larger page size) > > 1) order for some slab cache is reduce (by increasing > > fraction_size) > > How can increased fraction_size ever result in a lower order? I think > it can > only result in increased order (or same order). And the simulations > with my > hack patch don't seem to counter example that. Note previously I did > expect > the order to be lower (or same) and was surprised by my results, but > now I > realized I misunderstood the v4 patch. Hi, Sorry for late reply as i was on vacation :) You're absolutely right. Increasing the fraction size won't reduce the order, and I apologize for any confusion in my previous response. > > > 2) Have also seen reduction in overall slab cache numbers as > > because of > > increasing page order > > I think your results might be just due to randomness and could turn > out > different with repeating the test, or converge to be the same if you > average > multiple runs. You posted them for "160 CPUs with 64K Page size" and > if I > add that combination to my hack print, I see the same result before > and > after your patch: > > Calculated slab orders for page_shift 16 nr_cpus 160: > 8 0 > 1824 1 > 3648 2 > 7288 3 > 174768 2 > 196608 3 > 524296 4 > > Still, I might have a bug there. Can you confirm there are actual > differences with a /proc/slabinfo before/after your patch? If there > are > none, any differences observed have to be due to randomness, not > differences > in order. Indeed, to eliminate randomness, I've consistently gathered data from /proc/slabinfo, and I can confirm a decrease in the total number of slab caches. Values as on 160 cpu system with 64k page size Without patch 24892 slab caches with patch 23891 slab caches > > Going back to the idea behind your patch, I don't think it makes > sense to > try increase the fraction only for higher-orders. Yes, with 1/16 > fraction, > the waste with 64kB page can be 4kB, while with 1/32 it will be just > 2kB, > and with 4kB this is only 256 vs 128bytes. However the object sizes > and > counts don't differ with page size, so with 4kB pages we'll have more > slabs > to host the same number of objects, and the waste will accumulate > accordingly - i.e. the fraction metric should be independent of page > size > wrt resulting total kilobytes of waste. > > So maybe the only thing we need to do is to try setting it to 32 > initial > value instead of 16 regardless of page size. That should hopefully > again > show a good tradeoff for 4kB as one of the earlier versions, while on > 64kB > it shouldn't cause much difference (again, none at all with 160 cpus, > some > difference with less than 128 cpus, if my simulations were correct). > Yes, We can modify the default fraction size to 32 for all page sizes. I've noticed that on a 160 CPU system with a 64K page size, there's a noticeable change in the total memory allocated for slabs – it decreases. Alright, I'll make the necessary changes to the patch, setting the fraction size default to 32, and I'll post v5 along with some performance metrics. > > > > Anyway my point here is that this evaluation approach might be > > > useful, even > > > if it's a non-upstreamable hack, and some postprocessing of the > > > output is > > > needed for easier comparison of before/after, so feel free to try > > > that out. > > > > Thank you for this details test :) > > > BTW I'll be away for 2 weeks from now, so further feedback will > > > have > > > to come > > > from others in that time... > > > > > Do we have any additional feedback from others on the same matter? > > > > Thank > > > > Jay Patel > > > > Thanks! > > > > -- > > > > Hyeonggon
On 9/14/23 07:40, Jay Patel wrote: > On Thu, 2023-09-07 at 15:42 +0200, Vlastimil Babka wrote: >> On 8/24/23 12:52, Jay Patel wrote: >> How can increased fraction_size ever result in a lower order? I think >> it can >> only result in increased order (or same order). And the simulations >> with my >> hack patch don't seem to counter example that. Note previously I did >> expect >> the order to be lower (or same) and was surprised by my results, but >> now I >> realized I misunderstood the v4 patch. > > Hi, Sorry for late reply as i was on vacation :) > > You're absolutely > right. Increasing the fraction size won't reduce the order, and I > apologize for any confusion in my previous response. No problem, glad that it's cleared :) >> >> > 2) Have also seen reduction in overall slab cache numbers as >> > because of >> > increasing page order >> >> I think your results might be just due to randomness and could turn >> out >> different with repeating the test, or converge to be the same if you >> average >> multiple runs. You posted them for "160 CPUs with 64K Page size" and >> if I >> add that combination to my hack print, I see the same result before >> and >> after your patch: >> >> Calculated slab orders for page_shift 16 nr_cpus 160: >> 8 0 >> 1824 1 >> 3648 2 >> 7288 3 >> 174768 2 >> 196608 3 >> 524296 4 >> >> Still, I might have a bug there. Can you confirm there are actual >> differences with a /proc/slabinfo before/after your patch? If there >> are >> none, any differences observed have to be due to randomness, not >> differences >> in order. > > Indeed, to eliminate randomness, I've consistently gathered data from > /proc/slabinfo, and I can confirm a decrease in the total number of > slab caches. > > Values as on 160 cpu system with 64k page size > Without > patch 24892 slab caches > with patch 23891 slab caches I would like to see why exactly they decreased, given what the patch does it has to be due to getting a higher order slab pages. So the values of "<objperslab> <pagesperslab>" columns should increase for some caches - which ones and what is their <objsize>? >> >> Going back to the idea behind your patch, I don't think it makes >> sense to >> try increase the fraction only for higher-orders. Yes, with 1/16 >> fraction, >> the waste with 64kB page can be 4kB, while with 1/32 it will be just >> 2kB, >> and with 4kB this is only 256 vs 128bytes. However the object sizes >> and >> counts don't differ with page size, so with 4kB pages we'll have more >> slabs >> to host the same number of objects, and the waste will accumulate >> accordingly - i.e. the fraction metric should be independent of page >> size >> wrt resulting total kilobytes of waste. >> >> So maybe the only thing we need to do is to try setting it to 32 >> initial >> value instead of 16 regardless of page size. That should hopefully >> again >> show a good tradeoff for 4kB as one of the earlier versions, while on >> 64kB >> it shouldn't cause much difference (again, none at all with 160 cpus, >> some >> difference with less than 128 cpus, if my simulations were correct). >> > Yes, We can modify the default fraction size to 32 for all page sizes. > I've noticed that on a 160 CPU system with a 64K page size, there's a > noticeable change in the total memory allocated for slabs – it > decreases. > > Alright, I'll make the necessary changes to the patch, setting the > fraction size default to 32, and I'll post v5 along with some > performance metrics. Could you please also check my cleanup series at https://lore.kernel.org/all/20230908145302.30320-6-vbabka@suse.cz/ (I did Cc you there). If it makes sense, I'd like to apply the further optimization on top of those cleanups, not the other way around. Thanks! >> >> > > Anyway my point here is that this evaluation approach might be >> > > useful, even >> > > if it's a non-upstreamable hack, and some postprocessing of the >> > > output is >> > > needed for easier comparison of before/after, so feel free to try >> > > that out. >> > >> > Thank you for this details test :) >> > > BTW I'll be away for 2 weeks from now, so further feedback will >> > > have >> > > to come >> > > from others in that time... >> > > >> > Do we have any additional feedback from others on the same matter? >> > >> > Thank >> > >> > Jay Patel >> > > > Thanks! >> > > > -- >> > > > Hyeonggon >
On Thu, 2023-09-14 at 08:38 +0200, Vlastimil Babka wrote: > On 9/14/23 07:40, Jay Patel wrote: > > On Thu, 2023-09-07 at 15:42 +0200, Vlastimil Babka wrote: > > > On 8/24/23 12:52, Jay Patel wrote: > > > How can increased fraction_size ever result in a lower order? I > > > think > > > it can > > > only result in increased order (or same order). And the > > > simulations > > > with my > > > hack patch don't seem to counter example that. Note previously I > > > did > > > expect > > > the order to be lower (or same) and was surprised by my results, > > > but > > > now I > > > realized I misunderstood the v4 patch. > > > > Hi, Sorry for late reply as i was on vacation :) > > > > You're absolutely > > right. Increasing the fraction size won't reduce the order, and I > > apologize for any confusion in my previous response. > > No problem, glad that it's cleared :) > > > > > 2) Have also seen reduction in overall slab cache numbers as > > > > because of > > > > increasing page order > > > > > > I think your results might be just due to randomness and could > > > turn > > > out > > > different with repeating the test, or converge to be the same if > > > you > > > average > > > multiple runs. You posted them for "160 CPUs with 64K Page size" > > > and > > > if I > > > add that combination to my hack print, I see the same result > > > before > > > and > > > after your patch: > > > > > > Calculated slab orders for page_shift 16 nr_cpus 160: > > > 8 0 > > > 1824 1 > > > 3648 2 > > > 7288 3 > > > 174768 2 > > > 196608 3 > > > 524296 4 > > > > > > Still, I might have a bug there. Can you confirm there are actual > > > differences with a /proc/slabinfo before/after your patch? If > > > there > > > are > > > none, any differences observed have to be due to randomness, not > > > differences > > > in order. > > > > Indeed, to eliminate randomness, I've consistently gathered data > > from > > /proc/slabinfo, and I can confirm a decrease in the total number of > > slab caches. > > > > Values as on 160 cpu system with 64k page size > > Without > > patch 24892 slab caches > > with patch 23891 slab caches > > I would like to see why exactly they decreased, given what the patch > does it > has to be due to getting a higher order slab pages. So the values of > "<objperslab> <pagesperslab>" columns should increase for some caches > - > which ones and what is their <objsize>? yes correct, increase in page order for a slab cache will result in increasing values of "<objperslab> <pagesperslab>" I just check total numbers of slab cache, so let me check this values in details and will get back with objsize :) > > > > Going back to the idea behind your patch, I don't think it makes > > > sense to > > > try increase the fraction only for higher-orders. Yes, with 1/16 > > > fraction, > > > the waste with 64kB page can be 4kB, while with 1/32 it will be > > > just > > > 2kB, > > > and with 4kB this is only 256 vs 128bytes. However the object > > > sizes > > > and > > > counts don't differ with page size, so with 4kB pages we'll have > > > more > > > slabs > > > to host the same number of objects, and the waste will accumulate > > > accordingly - i.e. the fraction metric should be independent of > > > page > > > size > > > wrt resulting total kilobytes of waste. > > > > > > So maybe the only thing we need to do is to try setting it to 32 > > > initial > > > value instead of 16 regardless of page size. That should > > > hopefully > > > again > > > show a good tradeoff for 4kB as one of the earlier versions, > > > while on > > > 64kB > > > it shouldn't cause much difference (again, none at all with 160 > > > cpus, > > > some > > > difference with less than 128 cpus, if my simulations were > > > correct). > > > > > Yes, We can modify the default fraction size to 32 for all page > > sizes. > > I've noticed that on a 160 CPU system with a 64K page size, there's > > a > > noticeable change in the total memory allocated for slabs – it > > decreases. > > > > Alright, I'll make the necessary changes to the patch, setting the > > fraction size default to 32, and I'll post v5 along with some > > performance metrics. > > Could you please also check my cleanup series at > > https://lore.kernel.org/all/20230908145302.30320-6-vbabka@suse.cz/ > > (I did Cc you there). If it makes sense, I'd like to apply the > further > optimization on top of those cleanups, not the other way around. > > Thanks! > I've just gone through that patch series,and yes we can adjust the fraction size related change within that series :) > > > > > > > > Anyway my point here is that this evaluation approach might > > > > > be > > > > > useful, even > > > > > if it's a non-upstreamable hack, and some postprocessing of > > > > > the > > > > > output is > > > > > needed for easier comparison of before/after, so feel free to > > > > > try > > > > > that out. > > > > > > > > Thank you for this details test :) > > > > > BTW I'll be away for 2 weeks from now, so further feedback > > > > > will > > > > > have > > > > > to come > > > > > from others in that time... > > > > > > > > > Do we have any additional feedback from others on the same > > > > matter? > > > > > > > > Thank > > > > > > > > Jay Patel > > > > > > Thanks! > > > > > > -- > > > > > > Hyeonggon
diff --git a/mm/slub.c b/mm/slub.c index c87628cd8a9a..8f6f38083b94 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -287,6 +287,7 @@ static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s) #define OO_SHIFT 16 #define OO_MASK ((1 << OO_SHIFT) - 1) #define MAX_OBJS_PER_PAGE 32767 /* since slab.objects is u15 */ +#define SLUB_PAGE_FRAC_SHIFT 12 /* Internal SLUB flags */ /* Poison object */ @@ -4117,6 +4118,7 @@ static inline int calculate_order(unsigned int size) unsigned int min_objects; unsigned int max_objects; unsigned int nr_cpus; + unsigned int page_size_frac; /* * Attempt to find best configuration for a slab. This @@ -4145,10 +4147,13 @@ static inline int calculate_order(unsigned int size) max_objects = order_objects(slub_max_order, size); min_objects = min(min_objects, max_objects); - while (min_objects > 1) { + page_size_frac = ((PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT) == 1) ? 0 + : PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT; + + while (min_objects >= 1) { unsigned int fraction; - fraction = 16; + fraction = 16 + page_size_frac; while (fraction >= 4) { order = calc_slab_order(size, min_objects, slub_max_order, fraction); @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned int size) min_objects--; } - /* - * We were unable to place multiple objects in a slab. Now - * lets see if we can place a single object there. - */ - order = calc_slab_order(size, 1, slub_max_order, 1); - if (order <= slub_max_order) - return order; - /* * Doh this slab cannot be placed using slub_max_order. */
In the current implementation of the slub memory allocator, the slab order selection process follows these criteria: 1) Determine the minimum order required to serve the minimum number of objects (min_objects). This calculation is based on the formula (order = min_objects * object_size / PAGE_SIZE). 2) If the minimum order is greater than the maximum allowed order (slub_max_order), set slub_max_order as the order for this slab. 3) If the minimum order is less than the slub_max_order, iterate through a loop from minimum order to slub_max_order and check if the condition (rem <= slab_size / fract_leftover) holds true. Here, slab_size is calculated as (PAGE_SIZE << order), rem is (slab_size % object_size), and fract_leftover can have values of 16, 8, or 4. If the condition is true, select that order for the slab. However, in point 3, when calculating the fraction left over, it can result in a large range of values (like 1 Kb to 256 bytes on 4K page size & 4 Kb to 16 Kb on 64K page size with order 0 and goes on increasing with higher order) when compared to the remainder (rem). This can lead to the selection of an order that results in more memory wastage. To mitigate such wastage, we have modified point 3 as follows: To adjust the value of fract_leftover based on the page size, while retaining the current value as the default for a 4K page size. Test results are as follows: 1) On 160 CPUs with 64K Page size +-----------------+----------------+----------------+ | Total wastage in slub memory | +-----------------+----------------+----------------+ | | After Boot |After Hackbench | | Normal | 932 Kb | 1812 Kb | | With Patch | 729 Kb | 1636 Kb | | Wastage reduce | ~22% | ~10% | +-----------------+----------------+----------------+ +-----------------+----------------+----------------+ | Total slub memory | +-----------------+----------------+----------------+ | | After Boot | After Hackbench| | Normal | 1855296 | 2944576 | | With Patch | 1544576 | 2692032 | | Memory reduce | ~17% | ~9% | +-----------------+----------------+----------------+ hackbench-process-sockets +-------+-----+----------+----------+-----------+ | Amean | 1 | 1.2727 | 1.2450 | ( 2.22%) | | Amean | 4 | 1.6063 | 1.5810 | ( 1.60%) | | Amean | 7 | 2.4190 | 2.3983 | ( 0.86%) | | Amean | 12 | 3.9730 | 3.9347 | ( 0.97%) | | Amean | 21 | 6.9823 | 6.8957 | ( 1.26%) | | Amean | 30 | 10.1867 | 10.0600 | ( 1.26%) | | Amean | 48 | 16.7490 | 16.4853 | ( 1.60%) | | Amean | 79 | 28.1870 | 27.8673 | ( 1.15%) | | Amean | 110 | 39.8363 | 39.3793 | ( 1.16%) | | Amean | 141 | 51.5277 | 51.4907 | ( 0.07%) | | Amean | 172 | 62.9700 | 62.7300 | ( 0.38%) | | Amean | 203 | 74.5037 | 74.0630 | ( 0.59%) | | Amean | 234 | 85.6560 | 85.3587 | ( 0.35%) | | Amean | 265 | 96.9883 | 96.3770 | ( 0.63%) | | Amean | 296 | 108.6893 | 108.0870 | ( 0.56%) | +-------+-----+----------+----------+-----------+ 2) On 16 CPUs with 64K Page size +----------------+----------------+----------------+ | Total wastage in slub memory | +----------------+----------------+----------------+ | | After Boot | After Hackbench| | Normal | 273 Kb | 544 Kb | | With Patch | 260 Kb | 500 Kb | | Wastage reduce | ~5% | ~9% | +----------------+----------------+----------------+ +-----------------+----------------+----------------+ | Total slub memory | +-----------------+----------------+----------------+ | | After Boot | After Hackbench| | Normal | 275840 | 412480 | | With Patch | 272768 | 406208 | | Memory reduce | ~1% | ~2% | +-----------------+----------------+----------------+ hackbench-process-sockets +-------+----+---------+---------+-----------+ | Amean | 1 | 0.9513 | 0.9250 | ( 2.77%) | | Amean | 4 | 2.9630 | 2.9570 | ( 0.20%) | | Amean | 7 | 5.1780 | 5.1763 | ( 0.03%) | | Amean | 12 | 8.8833 | 8.8817 | ( 0.02%) | | Amean | 21 | 15.7577 | 15.6883 | ( 0.44%) | | Amean | 30 | 22.2063 | 22.2843 | ( -0.35%) | | Amean | 48 | 36.0587 | 36.1390 | ( -0.22%) | | Amean | 64 | 49.7803 | 49.3457 | ( 0.87%) | +-------+----+---------+---------+-----------+ Signed-off-by: Jay Patel <jaypatel@linux.ibm.com> --- Changes from V3 1) Resolved error and optimise logic for all arch Changes from V2 1) removed all page order selection logic for slab cache base on wastage. 2) Increasing fraction size base on page size (keeping current value as default to 4K page) Changes from V1 1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then it will return with PAGE_ALLOC_COSTLY_ORDER. 2) Similarly, if min_objects * object_size < PAGE_SIZE, then it will return with slub_min_order. 3) Additionally, I changed slub_max_order to 2. There is no specific reason for using the value 2, but it provided the best results in terms of performance without any noticeable impact. mm/slub.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-)