Message ID | ZJtBrybavtb1x45V@tpad (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs | expand |
Ping, apparently there is no objection to this patch... Christian, what is the preferred tree for integration? On Tue, Jun 27, 2023 at 05:08:15PM -0300, Marcelo Tosatti wrote: > > For certain types of applications (for example PLC software or > RAN processing), upon occurrence of an event, it is necessary to > complete a certain task in a maximum amount of time (deadline). > > One way to express this requirement is with a pair of numbers, > deadline time and execution time, where: > > * deadline time: length of time between event and deadline. > * execution time: length of time it takes for processing of event > to occur on a particular hardware platform > (uninterrupted). > > The particular values depend on use-case. For the case > where the realtime application executes in a virtualized > guest, an IPI which must be serviced in the host will cause > the following sequence of events: > > 1) VM-exit > 2) execution of IPI (and function call) > 3) VM-entry > > Which causes an excess of 50us latency as observed by cyclictest > (this violates the latency requirement of vRAN application with 1ms TTI, > for example). > > invalidate_bh_lrus calls an IPI on each CPU that has non empty > per-CPU cache: > > on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1); > > The performance when using the per-CPU LRU cache is as follows: > > 42 ns per __find_get_block > 68 ns per __find_get_block_slow > > Given that the main use cases for latency sensitive applications > do not involve block I/O (data necessary for program operation is > locked in RAM), disable per-CPU buffer_head caches for isolated CPUs. > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > diff --git a/fs/buffer.c b/fs/buffer.c > index a7fc561758b1..49e9160ce100 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -49,6 +49,7 @@ > #include <trace/events/block.h> > #include <linux/fscrypt.h> > #include <linux/fsverity.h> > +#include <linux/sched/isolation.h> > > #include "internal.h" > > @@ -1289,7 +1290,7 @@ static void bh_lru_install(struct buffer_head *bh) > * failing page migration. > * Skip putting upcoming bh into bh_lru until migration is done. > */ > - if (lru_cache_disabled()) { > + if (lru_cache_disabled() || cpu_is_isolated(smp_processor_id())) { > bh_lru_unlock(); > return; > } > @@ -1319,6 +1320,10 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size) > > check_irqs_on(); > bh_lru_lock(); > + if (cpu_is_isolated(smp_processor_id())) { > + bh_lru_unlock(); > + return NULL; > + } > for (i = 0; i < BH_LRU_SIZE; i++) { > struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]); >
On Wed, Jul 26, 2023 at 11:31:26AM -0300, Marcelo Tosatti wrote: > > Ping, apparently there is no objection to this patch... > > Christian, what is the preferred tree for integration? It'd be good if we could get an Ack from someone familiar with isolated cpus for this; or just in general from someone who can ack this.
On Thu, Jul 27, 2023 at 11:18:11AM +0200, Christian Brauner wrote: > On Wed, Jul 26, 2023 at 11:31:26AM -0300, Marcelo Tosatti wrote: > > > > Ping, apparently there is no objection to this patch... > > > > Christian, what is the preferred tree for integration? > > It'd be good if we could get an Ack from someone familiar with isolated > cpus for this; or just in general from someone who can ack this. Frederic?
On Tue, Jun 27, 2023 at 05:08:15PM -0300, Marcelo Tosatti wrote: > > For certain types of applications (for example PLC software or > RAN processing), upon occurrence of an event, it is necessary to > complete a certain task in a maximum amount of time (deadline). > > One way to express this requirement is with a pair of numbers, > deadline time and execution time, where: > > * deadline time: length of time between event and deadline. > * execution time: length of time it takes for processing of event > to occur on a particular hardware platform > (uninterrupted). > > The particular values depend on use-case. For the case > where the realtime application executes in a virtualized > guest, an IPI which must be serviced in the host will cause > the following sequence of events: > > 1) VM-exit > 2) execution of IPI (and function call) > 3) VM-entry > > Which causes an excess of 50us latency as observed by cyclictest > (this violates the latency requirement of vRAN application with 1ms TTI, > for example). > > invalidate_bh_lrus calls an IPI on each CPU that has non empty > per-CPU cache: > > on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1); > > The performance when using the per-CPU LRU cache is as follows: > > 42 ns per __find_get_block > 68 ns per __find_get_block_slow > > Given that the main use cases for latency sensitive applications > do not involve block I/O (data necessary for program operation is > locked in RAM), disable per-CPU buffer_head caches for isolated CPUs. So what happens if they ever do I/O then? Like if they need to do some prep work before entering an isolated critical section? Thanks. > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > diff --git a/fs/buffer.c b/fs/buffer.c > index a7fc561758b1..49e9160ce100 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -49,6 +49,7 @@ > #include <trace/events/block.h> > #include <linux/fscrypt.h> > #include <linux/fsverity.h> > +#include <linux/sched/isolation.h> > > #include "internal.h" > > @@ -1289,7 +1290,7 @@ static void bh_lru_install(struct buffer_head *bh) > * failing page migration. > * Skip putting upcoming bh into bh_lru until migration is done. > */ > - if (lru_cache_disabled()) { > + if (lru_cache_disabled() || cpu_is_isolated(smp_processor_id())) { > bh_lru_unlock(); > return; > } > @@ -1319,6 +1320,10 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size) > > check_irqs_on(); > bh_lru_lock(); > + if (cpu_is_isolated(smp_processor_id())) { > + bh_lru_unlock(); > + return NULL; > + } > for (i = 0; i < BH_LRU_SIZE; i++) { > struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]); > >
On Sat, Aug 05, 2023 at 12:03:59AM +0200, Frederic Weisbecker wrote: > On Tue, Jun 27, 2023 at 05:08:15PM -0300, Marcelo Tosatti wrote: > > > > For certain types of applications (for example PLC software or > > RAN processing), upon occurrence of an event, it is necessary to > > complete a certain task in a maximum amount of time (deadline). > > > > One way to express this requirement is with a pair of numbers, > > deadline time and execution time, where: > > > > * deadline time: length of time between event and deadline. > > * execution time: length of time it takes for processing of event > > to occur on a particular hardware platform > > (uninterrupted). > > > > The particular values depend on use-case. For the case > > where the realtime application executes in a virtualized > > guest, an IPI which must be serviced in the host will cause > > the following sequence of events: > > > > 1) VM-exit > > 2) execution of IPI (and function call) > > 3) VM-entry > > > > Which causes an excess of 50us latency as observed by cyclictest > > (this violates the latency requirement of vRAN application with 1ms TTI, > > for example). > > > > invalidate_bh_lrus calls an IPI on each CPU that has non empty > > per-CPU cache: > > > > on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1); > > > > The performance when using the per-CPU LRU cache is as follows: > > > > 42 ns per __find_get_block > > 68 ns per __find_get_block_slow > > > > Given that the main use cases for latency sensitive applications > > do not involve block I/O (data necessary for program operation is > > locked in RAM), disable per-CPU buffer_head caches for isolated CPUs. Hi Frederic, > So what happens if they ever do I/O then? Like if they need to do > some prep work before entering an isolated critical section? Then instead of going through the per-CPU LRU buffer_head cache (__find_get_block), isolated CPUs will work as if their per-CPU cache is always empty, going through the slowpath (__find_get_block_slow). The algorithm is: /* * Perform a pagecache lookup for the matching buffer. If it's there, refresh * it in the LRU and mark it as accessed. If it is not present then return * NULL */ struct buffer_head * __find_get_block(struct block_device *bdev, sector_t block, unsigned size) { struct buffer_head *bh = lookup_bh_lru(bdev, block, size); if (bh == NULL) { /* __find_get_block_slow will mark the page accessed */ bh = __find_get_block_slow(bdev, block); if (bh) bh_lru_install(bh); } else touch_buffer(bh); return bh; } EXPORT_SYMBOL(__find_get_block); I think the performance difference between the per-CPU LRU cache VS __find_get_block_slow was much more significant when the cache was introduced. Nowadays its only 26ns (moreover modern filesystems do not use buffer_head's). > Thanks. Thank you for the review. > > > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > > > diff --git a/fs/buffer.c b/fs/buffer.c > > index a7fc561758b1..49e9160ce100 100644 > > --- a/fs/buffer.c > > +++ b/fs/buffer.c > > @@ -49,6 +49,7 @@ > > #include <trace/events/block.h> > > #include <linux/fscrypt.h> > > #include <linux/fsverity.h> > > +#include <linux/sched/isolation.h> > > > > #include "internal.h" > > > > @@ -1289,7 +1290,7 @@ static void bh_lru_install(struct buffer_head *bh) > > * failing page migration. > > * Skip putting upcoming bh into bh_lru until migration is done. > > */ > > - if (lru_cache_disabled()) { > > + if (lru_cache_disabled() || cpu_is_isolated(smp_processor_id())) { > > bh_lru_unlock(); > > return; > > } > > @@ -1319,6 +1320,10 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size) > > > > check_irqs_on(); > > bh_lru_lock(); > > + if (cpu_is_isolated(smp_processor_id())) { > > + bh_lru_unlock(); > > + return NULL; > > + } > > for (i = 0; i < BH_LRU_SIZE; i++) { > > struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]); > > > > > >
On Fri, Aug 04, 2023 at 08:54:37PM -0300, Marcelo Tosatti wrote: > > So what happens if they ever do I/O then? Like if they need to do > > some prep work before entering an isolated critical section? > > Then instead of going through the per-CPU LRU buffer_head cache > (__find_get_block), isolated CPUs will work as if their per-CPU > cache is always empty, going through the slowpath > (__find_get_block_slow). The algorithm is: > > /* > * Perform a pagecache lookup for the matching buffer. If it's there, refresh > * it in the LRU and mark it as accessed. If it is not present then return > * NULL > */ > struct buffer_head * > __find_get_block(struct block_device *bdev, sector_t block, unsigned size) > { > struct buffer_head *bh = lookup_bh_lru(bdev, block, size); > > if (bh == NULL) { > /* __find_get_block_slow will mark the page accessed */ > bh = __find_get_block_slow(bdev, block); > if (bh) > bh_lru_install(bh); > } else > touch_buffer(bh); > > return bh; > } > EXPORT_SYMBOL(__find_get_block); > > I think the performance difference between the per-CPU LRU cache > VS __find_get_block_slow was much more significant when the cache > was introduced. Nowadays its only 26ns (moreover modern filesystems > do not use buffer_head's). Sounds good then! Acked-by: Frederic Weisbecker <frederic@kernel.org> Thanks!
On Tue, 27 Jun 2023 17:08:15 -0300, Marcelo Tosatti wrote: > For certain types of applications (for example PLC software or > RAN processing), upon occurrence of an event, it is necessary to > complete a certain task in a maximum amount of time (deadline). > > One way to express this requirement is with a pair of numbers, > deadline time and execution time, where: > > [...] Applied to the vfs.misc branch of the vfs/vfs.git tree. Patches in the vfs.misc branch should appear in linux-next soon. Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it. It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated. Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch. tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs.misc [1/1] fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs https://git.kernel.org/vfs/vfs/c/9ed7cfdf38b8
diff --git a/fs/buffer.c b/fs/buffer.c index a7fc561758b1..49e9160ce100 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -49,6 +49,7 @@ #include <trace/events/block.h> #include <linux/fscrypt.h> #include <linux/fsverity.h> +#include <linux/sched/isolation.h> #include "internal.h" @@ -1289,7 +1290,7 @@ static void bh_lru_install(struct buffer_head *bh) * failing page migration. * Skip putting upcoming bh into bh_lru until migration is done. */ - if (lru_cache_disabled()) { + if (lru_cache_disabled() || cpu_is_isolated(smp_processor_id())) { bh_lru_unlock(); return; } @@ -1319,6 +1320,10 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size) check_irqs_on(); bh_lru_lock(); + if (cpu_is_isolated(smp_processor_id())) { + bh_lru_unlock(); + return NULL; + } for (i = 0; i < BH_LRU_SIZE; i++) { struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
For certain types of applications (for example PLC software or RAN processing), upon occurrence of an event, it is necessary to complete a certain task in a maximum amount of time (deadline). One way to express this requirement is with a pair of numbers, deadline time and execution time, where: * deadline time: length of time between event and deadline. * execution time: length of time it takes for processing of event to occur on a particular hardware platform (uninterrupted). The particular values depend on use-case. For the case where the realtime application executes in a virtualized guest, an IPI which must be serviced in the host will cause the following sequence of events: 1) VM-exit 2) execution of IPI (and function call) 3) VM-entry Which causes an excess of 50us latency as observed by cyclictest (this violates the latency requirement of vRAN application with 1ms TTI, for example). invalidate_bh_lrus calls an IPI on each CPU that has non empty per-CPU cache: on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1); The performance when using the per-CPU LRU cache is as follows: 42 ns per __find_get_block 68 ns per __find_get_block_slow Given that the main use cases for latency sensitive applications do not involve block I/O (data necessary for program operation is locked in RAM), disable per-CPU buffer_head caches for isolated CPUs. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>