Message ID | 20220630073304.26945-1-feng.tang@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | iommu/iova: change IOVA_MAG_SIZE to 127 to save memory | expand |
On 2022-06-30 08:33, Feng Tang wrote: > kmalloc will round up the request size to power of 2, and current > iova_magazine's size is 1032 (1024+8) bytes, so each instance > allocated will get 2048 bytes from kmalloc, causing around 1KB > waste. > > And in some exstreme case, the memory wasted can trigger OOM as > reported in 2019 on a crash kernel with 256 MB memory [1]. I don't think it really needs pointing out that excessive memory consumption can cause OOM. Especially not in the particularly silly context of a system with only 2MB of RAM per CPU - that's pretty much guaranteed to be doomed one way or another. > [ 4.319253] iommu: Adding device 0000:06:00.2 to group 5 > [ 4.325869] iommu: Adding device 0000:20:01.0 to group 15 > [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 > [ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0 > [ 4.350251] swapper/0 cpuset=/ mems_allowed=0 > [ 4.354618] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.57.mx64.282 #1 > [ 4.355612] Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.9.3 06/25/2019 > [ 4.355612] Call Trace: > [ 4.355612] dump_stack+0x46/0x5b > [ 4.355612] dump_header+0x6b/0x289 > [ 4.355612] out_of_memory+0x470/0x4c0 > [ 4.355612] __alloc_pages_nodemask+0x970/0x1030 > [ 4.355612] cache_grow_begin+0x7d/0x520 > [ 4.355612] fallback_alloc+0x148/0x200 > [ 4.355612] kmem_cache_alloc_trace+0xac/0x1f0 > [ 4.355612] init_iova_domain+0x112/0x170 > [ 4.355612] amd_iommu_domain_alloc+0x138/0x1a0 > [ 4.355612] iommu_group_get_for_dev+0xc4/0x1a0 > [ 4.355612] amd_iommu_add_device+0x13a/0x610 > [ 4.355612] add_iommu_group+0x20/0x30 > [ 4.355612] bus_for_each_dev+0x76/0xc0 > [ 4.355612] bus_set_iommu+0xb6/0xf0 > [ 4.355612] amd_iommu_init_api+0x112/0x132 > [ 4.355612] state_next+0xfb1/0x1165 > [ 4.355612] amd_iommu_init+0x1f/0x67 > [ 4.355612] pci_iommu_init+0x16/0x3f > ... > [ 4.670295] Unreclaimable slab info: > ... > [ 4.857565] kmalloc-2048 59164KB 59164KB > > Change IOVA_MAG_SIZE from 128 to 127 to make size of 'iova_magazine' > 1024 bytes so that no memory will be wasted. > > [1]. https://lkml.org/lkml/2019/8/12/266 > > Signed-off-by: Feng Tang <feng.tang@intel.com> > --- > drivers/iommu/iova.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c > index db77aa675145b..27634ddd9b904 100644 > --- a/drivers/iommu/iova.c > +++ b/drivers/iommu/iova.c > @@ -614,7 +614,12 @@ EXPORT_SYMBOL_GPL(reserve_iova); > * dynamic size tuning described in the paper. > */ > > -#define IOVA_MAG_SIZE 128 > +/* > + * As kmalloc's buffer size is fixed to power of 2, 127 is chosen to > + * assure size of 'iova_magzine' to be 1024 bytes, so that no memory Typo: iova_magazine > + * will be wasted. > + */ > +#define IOVA_MAG_SIZE 127 The change itself seems perfectly reasonable, though. Acked-by: Robin Murphy <robin.murphy@arm.com> > #define MAX_GLOBAL_MAGS 32 /* magazines per bin */ > > struct iova_magazine {
Hi Robin, On Thu, Jun 30, 2022 at 10:02:00AM +0100, Robin Murphy wrote: > On 2022-06-30 08:33, Feng Tang wrote: > > kmalloc will round up the request size to power of 2, and current > > iova_magazine's size is 1032 (1024+8) bytes, so each instance > > allocated will get 2048 bytes from kmalloc, causing around 1KB > > waste. > > > > And in some exstreme case, the memory wasted can trigger OOM as > > reported in 2019 on a crash kernel with 256 MB memory [1]. > > I don't think it really needs pointing out that excessive memory consumption > can cause OOM. Especially not in the particularly silly context of a system > with only 2MB of RAM per CPU - that's pretty much guaranteed to be doomed > one way or another. Yes, the 256MB for 128 CPU is kind of extreme, will remove this. > > [ 4.319253] iommu: Adding device 0000:06:00.2 to group 5 > > [ 4.325869] iommu: Adding device 0000:20:01.0 to group 15 > > [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 > > [ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0 > > [ 4.350251] swapper/0 cpuset=/ mems_allowed=0 > > [ 4.354618] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.57.mx64.282 #1 > > [ 4.355612] Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.9.3 06/25/2019 > > [ 4.355612] Call Trace: > > [ 4.355612] dump_stack+0x46/0x5b > > [ 4.355612] dump_header+0x6b/0x289 > > [ 4.355612] out_of_memory+0x470/0x4c0 > > [ 4.355612] __alloc_pages_nodemask+0x970/0x1030 > > [ 4.355612] cache_grow_begin+0x7d/0x520 > > [ 4.355612] fallback_alloc+0x148/0x200 > > [ 4.355612] kmem_cache_alloc_trace+0xac/0x1f0 > > [ 4.355612] init_iova_domain+0x112/0x170 > > [ 4.355612] amd_iommu_domain_alloc+0x138/0x1a0 > > [ 4.355612] iommu_group_get_for_dev+0xc4/0x1a0 > > [ 4.355612] amd_iommu_add_device+0x13a/0x610 > > [ 4.355612] add_iommu_group+0x20/0x30 > > [ 4.355612] bus_for_each_dev+0x76/0xc0 > > [ 4.355612] bus_set_iommu+0xb6/0xf0 > > [ 4.355612] amd_iommu_init_api+0x112/0x132 > > [ 4.355612] state_next+0xfb1/0x1165 > > [ 4.355612] amd_iommu_init+0x1f/0x67 > > [ 4.355612] pci_iommu_init+0x16/0x3f > > ... > > [ 4.670295] Unreclaimable slab info: > > ... > > [ 4.857565] kmalloc-2048 59164KB 59164KB > > > > Change IOVA_MAG_SIZE from 128 to 127 to make size of 'iova_magazine' > > 1024 bytes so that no memory will be wasted. > > > > [1]. https://lkml.org/lkml/2019/8/12/266 > > > > Signed-off-by: Feng Tang <feng.tang@intel.com> > > --- > > drivers/iommu/iova.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c > > index db77aa675145b..27634ddd9b904 100644 > > --- a/drivers/iommu/iova.c > > +++ b/drivers/iommu/iova.c > > @@ -614,7 +614,12 @@ EXPORT_SYMBOL_GPL(reserve_iova); > > * dynamic size tuning described in the paper. > > */ > > -#define IOVA_MAG_SIZE 128 > > +/* > > + * As kmalloc's buffer size is fixed to power of 2, 127 is chosen to > > + * assure size of 'iova_magzine' to be 1024 bytes, so that no memory > > Typo: iova_magazine will fix. > > + * will be wasted. > > + */ > > +#define IOVA_MAG_SIZE 127 > > The change itself seems perfectly reasonable, though. > > Acked-by: Robin Murphy <robin.murphy@arm.com> Thanks for the review! - Feng > > #define MAX_GLOBAL_MAGS 32 /* magazines per bin */ > > struct iova_magazine {
On 30/06/2022 10:02, Robin Murphy wrote: > On 2022-06-30 08:33, Feng Tang wrote: >> kmalloc will round up the request size to power of 2, and current >> iova_magazine's size is 1032 (1024+8) bytes, so each instance >> allocated will get 2048 bytes from kmalloc, causing around 1KB >> waste. >> >> And in some exstreme case, the memory wasted can trigger OOM as >> reported in 2019 on a crash kernel with 256 MB memory [1]. > > I don't think it really needs pointing out that excessive memory > consumption can cause OOM. Especially not in the particularly silly > context of a system with only 2MB of RAM per CPU - that's pretty much > guaranteed to be doomed one way or another. > >> [ 4.319253] iommu: Adding device 0000:06:00.2 to group 5 >> [ 4.325869] iommu: Adding device 0000:20:01.0 to group 15 >> [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 >> [ 4.338946] swapper/0 invoked oom-killer: >> gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, >> oom_score_adj=0 >> [ 4.350251] swapper/0 cpuset=/ mems_allowed=0 >> [ 4.354618] CPU: 0 PID: 1 Comm: swapper/0 Not tainted >> 4.19.57.mx64.282 #1 >> [ 4.355612] Hardware name: Dell Inc. PowerEdge R7425/08V001, >> BIOS 1.9.3 06/25/2019 >> [ 4.355612] Call Trace: >> [ 4.355612] dump_stack+0x46/0x5b >> [ 4.355612] dump_header+0x6b/0x289 >> [ 4.355612] out_of_memory+0x470/0x4c0 >> [ 4.355612] __alloc_pages_nodemask+0x970/0x1030 >> [ 4.355612] cache_grow_begin+0x7d/0x520 >> [ 4.355612] fallback_alloc+0x148/0x200 >> [ 4.355612] kmem_cache_alloc_trace+0xac/0x1f0 >> [ 4.355612] init_iova_domain+0x112/0x170 >> [ 4.355612] amd_iommu_domain_alloc+0x138/0x1a0 >> [ 4.355612] iommu_group_get_for_dev+0xc4/0x1a0 >> [ 4.355612] amd_iommu_add_device+0x13a/0x610 >> [ 4.355612] add_iommu_group+0x20/0x30 >> [ 4.355612] bus_for_each_dev+0x76/0xc0 >> [ 4.355612] bus_set_iommu+0xb6/0xf0 >> [ 4.355612] amd_iommu_init_api+0x112/0x132 >> [ 4.355612] state_next+0xfb1/0x1165 >> [ 4.355612] amd_iommu_init+0x1f/0x67 >> [ 4.355612] pci_iommu_init+0x16/0x3f >> ... >> [ 4.670295] Unreclaimable slab info: >> ... >> [ 4.857565] kmalloc-2048 59164KB 59164KB >> >> Change IOVA_MAG_SIZE from 128 to 127 to make size of 'iova_magazine' >> 1024 bytes so that no memory will be wasted. >> >> [1]. https://lkml.org/lkml/2019/8/12/266 >> >> Signed-off-by: Feng Tang <feng.tang@intel.com> >> --- >> drivers/iommu/iova.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c >> index db77aa675145b..27634ddd9b904 100644 >> --- a/drivers/iommu/iova.c >> +++ b/drivers/iommu/iova.c >> @@ -614,7 +614,12 @@ EXPORT_SYMBOL_GPL(reserve_iova); >> * dynamic size tuning described in the paper. >> */ >> -#define IOVA_MAG_SIZE 128 >> +/* >> + * As kmalloc's buffer size is fixed to power of 2, 127 is chosen to >> + * assure size of 'iova_magzine' to be 1024 bytes, so that no memory > > Typo: iova_magazine > >> + * will be wasted. >> + */ >> +#define IOVA_MAG_SIZE 127 I do wonder if we will see some strange new behaviour since IOVA_FQ_SIZE % IOVA_MAG_SIZE != 0 now... > > The change itself seems perfectly reasonable, though. > > Acked-by: Robin Murphy <robin.murphy@arm.com>
On 2022-06-30 10:37, John Garry wrote: > On 30/06/2022 10:02, Robin Murphy wrote: >> On 2022-06-30 08:33, Feng Tang wrote: >>> kmalloc will round up the request size to power of 2, and current >>> iova_magazine's size is 1032 (1024+8) bytes, so each instance >>> allocated will get 2048 bytes from kmalloc, causing around 1KB >>> waste. >>> >>> And in some exstreme case, the memory wasted can trigger OOM as >>> reported in 2019 on a crash kernel with 256 MB memory [1]. >> >> I don't think it really needs pointing out that excessive memory >> consumption can cause OOM. Especially not in the particularly silly >> context of a system with only 2MB of RAM per CPU - that's pretty much >> guaranteed to be doomed one way or another. >> >>> [ 4.319253] iommu: Adding device 0000:06:00.2 to group 5 >>> [ 4.325869] iommu: Adding device 0000:20:01.0 to group 15 >>> [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 >>> [ 4.338946] swapper/0 invoked oom-killer: >>> gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, >>> oom_score_adj=0 >>> [ 4.350251] swapper/0 cpuset=/ mems_allowed=0 >>> [ 4.354618] CPU: 0 PID: 1 Comm: swapper/0 Not tainted >>> 4.19.57.mx64.282 #1 >>> [ 4.355612] Hardware name: Dell Inc. PowerEdge R7425/08V001, >>> BIOS 1.9.3 06/25/2019 >>> [ 4.355612] Call Trace: >>> [ 4.355612] dump_stack+0x46/0x5b >>> [ 4.355612] dump_header+0x6b/0x289 >>> [ 4.355612] out_of_memory+0x470/0x4c0 >>> [ 4.355612] __alloc_pages_nodemask+0x970/0x1030 >>> [ 4.355612] cache_grow_begin+0x7d/0x520 >>> [ 4.355612] fallback_alloc+0x148/0x200 >>> [ 4.355612] kmem_cache_alloc_trace+0xac/0x1f0 >>> [ 4.355612] init_iova_domain+0x112/0x170 >>> [ 4.355612] amd_iommu_domain_alloc+0x138/0x1a0 >>> [ 4.355612] iommu_group_get_for_dev+0xc4/0x1a0 >>> [ 4.355612] amd_iommu_add_device+0x13a/0x610 >>> [ 4.355612] add_iommu_group+0x20/0x30 >>> [ 4.355612] bus_for_each_dev+0x76/0xc0 >>> [ 4.355612] bus_set_iommu+0xb6/0xf0 >>> [ 4.355612] amd_iommu_init_api+0x112/0x132 >>> [ 4.355612] state_next+0xfb1/0x1165 >>> [ 4.355612] amd_iommu_init+0x1f/0x67 >>> [ 4.355612] pci_iommu_init+0x16/0x3f >>> ... >>> [ 4.670295] Unreclaimable slab info: >>> ... >>> [ 4.857565] kmalloc-2048 59164KB 59164KB >>> >>> Change IOVA_MAG_SIZE from 128 to 127 to make size of 'iova_magazine' >>> 1024 bytes so that no memory will be wasted. >>> >>> [1]. https://lkml.org/lkml/2019/8/12/266 >>> >>> Signed-off-by: Feng Tang <feng.tang@intel.com> >>> --- >>> drivers/iommu/iova.c | 7 ++++++- >>> 1 file changed, 6 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c >>> index db77aa675145b..27634ddd9b904 100644 >>> --- a/drivers/iommu/iova.c >>> +++ b/drivers/iommu/iova.c >>> @@ -614,7 +614,12 @@ EXPORT_SYMBOL_GPL(reserve_iova); >>> * dynamic size tuning described in the paper. >>> */ >>> -#define IOVA_MAG_SIZE 128 >>> +/* >>> + * As kmalloc's buffer size is fixed to power of 2, 127 is chosen to >>> + * assure size of 'iova_magzine' to be 1024 bytes, so that no memory >> >> Typo: iova_magazine >> >>> + * will be wasted. >>> + */ >>> +#define IOVA_MAG_SIZE 127 > > I do wonder if we will see some strange new behaviour since IOVA_FQ_SIZE > % IOVA_MAG_SIZE != 0 now... I doubt it - even if a flush queue does happen to be entirely full of equal-sized IOVAs, a CPU's loaded magazines also both being perfectly empty when it comes to dump a full fq seem further unlikely, so in practice I don't see this making any appreciable change to the likelihood of spilling back to the depot or not. In fact the smaller the magazines get, the less time would be spent flushing the depot back to the rbtree, where your interesting workload falls off the cliff and never catches back up with the fq timer, so at some point it might even improve (unless it's also already close to the point where smaller caches would bottleneck allocation)... might be interesting to experiment with a wider range of magazine sizes if you had the time and inclination. Cheers, Robin. > >> >> The change itself seems perfectly reasonable, though. >> >> Acked-by: Robin Murphy <robin.murphy@arm.com> >
>>> >>>> [ 4.319253] iommu: Adding device 0000:06:00.2 to group 5 >>>> [ 4.325869] iommu: Adding device 0000:20:01.0 to group 15 >>>> [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 >>>> [ 4.338946] swapper/0 invoked oom-killer: >>>> gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, >>>> oom_score_adj=0 >>>> [ 4.350251] swapper/0 cpuset=/ mems_allowed=0 >>>> [ 4.354618] CPU: 0 PID: 1 Comm: swapper/0 Not tainted >>>> 4.19.57.mx64.282 #1 >>>> [ 4.355612] Hardware name: Dell Inc. PowerEdge R7425/08V001, >>>> BIOS 1.9.3 06/25/2019 >>>> [ 4.355612] Call Trace: >>>> [ 4.355612] dump_stack+0x46/0x5b >>>> [ 4.355612] dump_header+0x6b/0x289 >>>> [ 4.355612] out_of_memory+0x470/0x4c0 >>>> [ 4.355612] __alloc_pages_nodemask+0x970/0x1030 >>>> [ 4.355612] cache_grow_begin+0x7d/0x520 >>>> [ 4.355612] fallback_alloc+0x148/0x200 >>>> [ 4.355612] kmem_cache_alloc_trace+0xac/0x1f0 >>>> [ 4.355612] init_iova_domain+0x112/0x170 Note for Feng Tang: This callchain does not exist anymore since we separated out the rcache init from the IOVA domain init. Indeed, not so much memory is wasted for unused rcaches now. My point really is that it would be nicer to see a modern callchain - but don't read that as me saying that the change is this patch is bad. >>>> [ 4.355612] amd_iommu_domain_alloc+0x138/0x1a0 >>>> [ 4.355612] iommu_group_get_for_dev+0xc4/0x1a0 >>>> [ 4.355612] amd_iommu_add_device+0x13a/0x610 >>>> [ 4.355612] add_iommu_group+0x20/0x30 >>>> [ 4.355612] bus_for_each_dev+0x76/0xc0 >>>> [ 4.355612] bus_set_iommu+0xb6/0xf0 >>>> [ 4.355612] amd_iommu_init_api+0x112/0x132 >>>> [ 4.355612] state_next+0xfb1/0x1165 >>>> [ 4.355612] amd_iommu_init+0x1f/0x67 >>>> [ 4.355612] pci_iommu_init+0x16/0x3f >>>> ... >>>> [ 4.670295] Unreclaimable slab info: >>>> ... >>>> [ 4.857565] kmalloc-2048 59164KB 59164KB >>>> >>>> Change IOVA_MAG_SIZE from 128 to 127 to make size of 'iova_magazine' >>>> 1024 bytes so that no memory will be wasted. >>>> >>>> [1]. https://lkml.org/lkml/2019/8/12/266 >>>> >>>> Signed-off-by: Feng Tang <feng.tang@intel.com> >>>> --- >>>> drivers/iommu/iova.c | 7 ++++++- >>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c >>>> index db77aa675145b..27634ddd9b904 100644 >>>> --- a/drivers/iommu/iova.c >>>> +++ b/drivers/iommu/iova.c >>>> @@ -614,7 +614,12 @@ EXPORT_SYMBOL_GPL(reserve_iova); >>>> * dynamic size tuning described in the paper. >>>> */ >>>> -#define IOVA_MAG_SIZE 128 >>>> +/* >>>> + * As kmalloc's buffer size is fixed to power of 2, 127 is chosen to >>>> + * assure size of 'iova_magzine' to be 1024 bytes, so that no memory >>> >>> Typo: iova_magazine >>> >>>> + * will be wasted. >>>> + */ >>>> +#define IOVA_MAG_SIZE 127 >> >> I do wonder if we will see some strange new behaviour since >> IOVA_FQ_SIZE % IOVA_MAG_SIZE != 0 now... > > I doubt it - even if a flush queue does happen to be entirely full of > equal-sized IOVAs, a CPU's loaded magazines also both being perfectly > empty when it comes to dump a full fq seem further unlikely, so in > practice I don't see this making any appreciable change to the > likelihood of spilling back to the depot or not. In fact the smaller the > magazines get, the less time would be spent flushing the depot back to > the rbtree, where your interesting workload falls off the cliff and > never catches back up with the fq timer, so at some point it might even > improve (unless it's also already close to the point where smaller > caches would bottleneck allocation)... might be interesting to > experiment with a wider range of magazine sizes if you had the time and > inclination. > ok, what you are saying sounds reasonable. I just remember that when we analyzed the longterm aging issue that we concluded that the FQ size and its relation to the magazine size was a factor and this change makes me a little worried about new issues. Better the devil you know and all that... Anyway, if I get some time I might do some testing to see if this change has any influence. Another thought is if we need even store the size in the iova_magazine? mags in the depot are always full. As such, we only need worry about mags loaded in the cpu rcache and their sizes, so maybe we could have something like this: struct iova_magazine { - unsigned long size; unsigned long pfns[IOVA_MAG_SIZE]; }; @@ -631,6 +630,8 @@ struct iova_cpu_rcache { spinlock_t lock; struct iova_magazine *loaded; struct iova_magazine *prev; + int loaded_size; + int prev_size; }; I haven't tried to implement it though.. Thanks, John
Hi John, On Thu, Jun 30, 2022 at 11:52:18AM +0100, John Garry wrote: > > > > > > > > > [ 4.319253] iommu: Adding device 0000:06:00.2 to group 5 > > > > > [ 4.325869] iommu: Adding device 0000:20:01.0 to group 15 > > > > > [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 > > > > > [ 4.338946] swapper/0 invoked oom-killer: > > > > > gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), > > > > > order=0, oom_score_adj=0 > > > > > [ 4.350251] swapper/0 cpuset=/ mems_allowed=0 > > > > > [ 4.354618] CPU: 0 PID: 1 Comm: swapper/0 Not > > > > > tainted 4.19.57.mx64.282 #1 > > > > > [ 4.355612] Hardware name: Dell Inc. PowerEdge > > > > > R7425/08V001, BIOS 1.9.3 06/25/2019 > > > > > [ 4.355612] Call Trace: > > > > > [ 4.355612] dump_stack+0x46/0x5b > > > > > [ 4.355612] dump_header+0x6b/0x289 > > > > > [ 4.355612] out_of_memory+0x470/0x4c0 > > > > > [ 4.355612] __alloc_pages_nodemask+0x970/0x1030 > > > > > [ 4.355612] cache_grow_begin+0x7d/0x520 > > > > > [ 4.355612] fallback_alloc+0x148/0x200 > > > > > [ 4.355612] kmem_cache_alloc_trace+0xac/0x1f0 > > > > > [ 4.355612] init_iova_domain+0x112/0x170 > > Note for Feng Tang: This callchain does not exist anymore since we separated > out the rcache init from the IOVA domain init. Indeed, not so much memory is > wasted for unused rcaches now. Thanks for the info, I also planned to remove the callstack as Robin suggested. > My point really is that it would be nicer to see a modern callchain - but > don't read that as me saying that the change is this patch is bad. > > > > > > [ 4.355612] amd_iommu_domain_alloc+0x138/0x1a0 > > > > > [ 4.355612] iommu_group_get_for_dev+0xc4/0x1a0 > > > > > [ 4.355612] amd_iommu_add_device+0x13a/0x610 > > > > > [ 4.355612] add_iommu_group+0x20/0x30 > > > > > [ 4.355612] bus_for_each_dev+0x76/0xc0 > > > > > [ 4.355612] bus_set_iommu+0xb6/0xf0 > > > > > [ 4.355612] amd_iommu_init_api+0x112/0x132 > > > > > [ 4.355612] state_next+0xfb1/0x1165 > > > > > [ 4.355612] amd_iommu_init+0x1f/0x67 > > > > > [ 4.355612] pci_iommu_init+0x16/0x3f > > > > > ... > > > > > [ 4.670295] Unreclaimable slab info: > > > > > ... > > > > > [ 4.857565] kmalloc-2048 59164KB 59164KB > > > > > > > > > > Change IOVA_MAG_SIZE from 128 to 127 to make size of 'iova_magazine' > > > > > 1024 bytes so that no memory will be wasted. > > > > > > > > > > [1]. https://lkml.org/lkml/2019/8/12/266 > > > > > > > > > > Signed-off-by: Feng Tang <feng.tang@intel.com> > > > > > --- > > > > > drivers/iommu/iova.c | 7 ++++++- > > > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c > > > > > index db77aa675145b..27634ddd9b904 100644 > > > > > --- a/drivers/iommu/iova.c > > > > > +++ b/drivers/iommu/iova.c > > > > > @@ -614,7 +614,12 @@ EXPORT_SYMBOL_GPL(reserve_iova); > > > > > * dynamic size tuning described in the paper. > > > > > */ > > > > > -#define IOVA_MAG_SIZE 128 > > > > > +/* > > > > > + * As kmalloc's buffer size is fixed to power of 2, 127 is chosen to > > > > > + * assure size of 'iova_magzine' to be 1024 bytes, so that no memory > > > > > > > > Typo: iova_magazine > > > > > > > > > + * will be wasted. > > > > > + */ > > > > > +#define IOVA_MAG_SIZE 127 > > > > > > I do wonder if we will see some strange new behaviour since > > > IOVA_FQ_SIZE % IOVA_MAG_SIZE != 0 now... > > > > I doubt it - even if a flush queue does happen to be entirely full of > > equal-sized IOVAs, a CPU's loaded magazines also both being perfectly > > empty when it comes to dump a full fq seem further unlikely, so in > > practice I don't see this making any appreciable change to the > > likelihood of spilling back to the depot or not. In fact the smaller the > > magazines get, the less time would be spent flushing the depot back to > > the rbtree, where your interesting workload falls off the cliff and> > never catches back up with the fq timer, so at some point it might even > > improve (unless it's also already close to the point where smaller > > caches would bottleneck allocation)... might be interesting to > > experiment with a wider range of magazine sizes if you had the time and > > inclination. > > > > ok, what you are saying sounds reasonable. I just remember that when we > analyzed the longterm aging issue that we concluded that the FQ size and its > relation to the magazine size was a factor and this change makes me a little > worried about new issues. Better the devil you know and all that... > > Anyway, if I get some time I might do some testing to see if this change has > any influence. > > Another thought is if we need even store the size in the iova_magazine? mags > in the depot are always full. As such, we only need worry about mags loaded > in the cpu rcache and their sizes, so maybe we could have something like > this: > > struct iova_magazine { > - unsigned long size; > unsigned long pfns[IOVA_MAG_SIZE]; > }; > > @@ -631,6 +630,8 @@ struct iova_cpu_rcache { > spinlock_t lock; > struct iova_magazine *loaded; > struct iova_magazine *prev; > + int loaded_size; > + int prev_size; > }; > > I haven't tried to implement it though.. I have very few knowledge of iova, so you can chose what's the better solution. I just wanted to raise the problem and will be happy to see it solved :) Thanks, Feng > Thanks, > John
On 01/07/2022 04:56, Feng Tang wrote: >>> inclination. >>> >> ok, what you are saying sounds reasonable. I just remember that when we >> analyzed the longterm aging issue that we concluded that the FQ size and its >> relation to the magazine size was a factor and this change makes me a little >> worried about new issues. Better the devil you know and all that... >> >> Anyway, if I get some time I might do some testing to see if this change has >> any influence. >> >> Another thought is if we need even store the size in the iova_magazine? mags >> in the depot are always full. As such, we only need worry about mags loaded >> in the cpu rcache and their sizes, so maybe we could have something like >> this: >> >> struct iova_magazine { >> - unsigned long size; >> unsigned long pfns[IOVA_MAG_SIZE]; >> }; >> >> @@ -631,6 +630,8 @@ struct iova_cpu_rcache { >> spinlock_t lock; >> struct iova_magazine *loaded; >> struct iova_magazine *prev; >> + int loaded_size; >> + int prev_size; >> }; >> >> I haven't tried to implement it though.. > I have very few knowledge of iova, so you can chose what's the better > solution. I just wanted to raise the problem and will be happy to see > it solved:) I quickly tested your patch for performance and saw no noticeable difference, which is no surprise. But I'll defer to Robin if he thinks that your patch is a better solution - I would guess that he does. For me personally I would prefer that this value was not changed, as I mentioned before. thanks, John
On 2022-07-01 12:33, John Garry wrote: > On 01/07/2022 04:56, Feng Tang wrote: >>>> inclination. >>>> >>> ok, what you are saying sounds reasonable. I just remember that when we >>> analyzed the longterm aging issue that we concluded that the FQ size >>> and its >>> relation to the magazine size was a factor and this change makes me a >>> little >>> worried about new issues. Better the devil you know and all that... >>> >>> Anyway, if I get some time I might do some testing to see if this >>> change has >>> any influence. >>> >>> Another thought is if we need even store the size in the >>> iova_magazine? mags >>> in the depot are always full. As such, we only need worry about mags >>> loaded >>> in the cpu rcache and their sizes, so maybe we could have something like >>> this: >>> >>> struct iova_magazine { >>> - unsigned long size; >>> unsigned long pfns[IOVA_MAG_SIZE]; >>> }; >>> >>> @@ -631,6 +630,8 @@ struct iova_cpu_rcache { >>> spinlock_t lock; >>> struct iova_magazine *loaded; >>> struct iova_magazine *prev; >>> + int loaded_size; >>> + int prev_size; >>> }; >>> >>> I haven't tried to implement it though.. >> I have very few knowledge of iova, so you can chose what's the better >> solution. I just wanted to raise the problem and will be happy to see >> it solved:) > > I quickly tested your patch for performance and saw no noticeable > difference, which is no surprise. > > But I'll defer to Robin if he thinks that your patch is a better > solution - I would guess that he does. For me personally I would prefer > that this value was not changed, as I mentioned before. This idea is interesting, but it would mean a bit more fiddly work to keep things in sync when magazines are allocated, freed and swapped around. It seems like the kind of non-obvious thing that might make sense if it gave a significant improvement in cache locality or something like that, but for simply fixing an allocation size it feels a bit too wacky. From my perspective, indeed I'd rather do the simple thing for now to address the memory wastage issue directly, then we can do the deeper performance analysis on top to see if further tweaking of magazine sizes and/or design is justified. Cheers, Robin.
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145b..27634ddd9b904 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -614,7 +614,12 @@ EXPORT_SYMBOL_GPL(reserve_iova); * dynamic size tuning described in the paper. */ -#define IOVA_MAG_SIZE 128 +/* + * As kmalloc's buffer size is fixed to power of 2, 127 is chosen to + * assure size of 'iova_magzine' to be 1024 bytes, so that no memory + * will be wasted. + */ +#define IOVA_MAG_SIZE 127 #define MAX_GLOBAL_MAGS 32 /* magazines per bin */ struct iova_magazine {
kmalloc will round up the request size to power of 2, and current iova_magazine's size is 1032 (1024+8) bytes, so each instance allocated will get 2048 bytes from kmalloc, causing around 1KB waste. And in some exstreme case, the memory wasted can trigger OOM as reported in 2019 on a crash kernel with 256 MB memory [1]. [ 4.319253] iommu: Adding device 0000:06:00.2 to group 5 [ 4.325869] iommu: Adding device 0000:20:01.0 to group 15 [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 [ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0 [ 4.350251] swapper/0 cpuset=/ mems_allowed=0 [ 4.354618] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.57.mx64.282 #1 [ 4.355612] Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.9.3 06/25/2019 [ 4.355612] Call Trace: [ 4.355612] dump_stack+0x46/0x5b [ 4.355612] dump_header+0x6b/0x289 [ 4.355612] out_of_memory+0x470/0x4c0 [ 4.355612] __alloc_pages_nodemask+0x970/0x1030 [ 4.355612] cache_grow_begin+0x7d/0x520 [ 4.355612] fallback_alloc+0x148/0x200 [ 4.355612] kmem_cache_alloc_trace+0xac/0x1f0 [ 4.355612] init_iova_domain+0x112/0x170 [ 4.355612] amd_iommu_domain_alloc+0x138/0x1a0 [ 4.355612] iommu_group_get_for_dev+0xc4/0x1a0 [ 4.355612] amd_iommu_add_device+0x13a/0x610 [ 4.355612] add_iommu_group+0x20/0x30 [ 4.355612] bus_for_each_dev+0x76/0xc0 [ 4.355612] bus_set_iommu+0xb6/0xf0 [ 4.355612] amd_iommu_init_api+0x112/0x132 [ 4.355612] state_next+0xfb1/0x1165 [ 4.355612] amd_iommu_init+0x1f/0x67 [ 4.355612] pci_iommu_init+0x16/0x3f ... [ 4.670295] Unreclaimable slab info: ... [ 4.857565] kmalloc-2048 59164KB 59164KB Change IOVA_MAG_SIZE from 128 to 127 to make size of 'iova_magazine' 1024 bytes so that no memory will be wasted. [1]. https://lkml.org/lkml/2019/8/12/266 Signed-off-by: Feng Tang <feng.tang@intel.com> --- drivers/iommu/iova.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)