diff mbox series

fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs

Message ID ZJtBrybavtb1x45V@tpad (mailing list archive)
State New, archived
Headers show
Series fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs | expand

Commit Message

Marcelo Tosatti June 27, 2023, 8:08 p.m. UTC
For certain types of applications (for example PLC software or
RAN processing), upon occurrence of an event, it is necessary to
complete a certain task in a maximum amount of time (deadline).

One way to express this requirement is with a pair of numbers,
deadline time and execution time, where:

        * deadline time: length of time between event and deadline.
        * execution time: length of time it takes for processing of event
                          to occur on a particular hardware platform
                          (uninterrupted).

The particular values depend on use-case. For the case
where the realtime application executes in a virtualized
guest, an IPI which must be serviced in the host will cause
the following sequence of events:

        1) VM-exit
        2) execution of IPI (and function call)
        3) VM-entry

Which causes an excess of 50us latency as observed by cyclictest
(this violates the latency requirement of vRAN application with 1ms TTI,
for example).

invalidate_bh_lrus calls an IPI on each CPU that has non empty
per-CPU cache:

        on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);

The performance when using the per-CPU LRU cache is as follows:

 42 ns per __find_get_block
 68 ns per __find_get_block_slow

Given that the main use cases for latency sensitive applications
do not involve block I/O (data necessary for program operation is 
locked in RAM), disable per-CPU buffer_head caches for isolated CPUs.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Comments

Marcelo Tosatti July 26, 2023, 2:31 p.m. UTC | #1
Ping, apparently there is no objection to this patch...

Christian, what is the preferred tree for integration?

On Tue, Jun 27, 2023 at 05:08:15PM -0300, Marcelo Tosatti wrote:
> 
> For certain types of applications (for example PLC software or
> RAN processing), upon occurrence of an event, it is necessary to
> complete a certain task in a maximum amount of time (deadline).
> 
> One way to express this requirement is with a pair of numbers,
> deadline time and execution time, where:
> 
>         * deadline time: length of time between event and deadline.
>         * execution time: length of time it takes for processing of event
>                           to occur on a particular hardware platform
>                           (uninterrupted).
> 
> The particular values depend on use-case. For the case
> where the realtime application executes in a virtualized
> guest, an IPI which must be serviced in the host will cause
> the following sequence of events:
> 
>         1) VM-exit
>         2) execution of IPI (and function call)
>         3) VM-entry
> 
> Which causes an excess of 50us latency as observed by cyclictest
> (this violates the latency requirement of vRAN application with 1ms TTI,
> for example).
> 
> invalidate_bh_lrus calls an IPI on each CPU that has non empty
> per-CPU cache:
> 
>         on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
> 
> The performance when using the per-CPU LRU cache is as follows:
> 
>  42 ns per __find_get_block
>  68 ns per __find_get_block_slow
> 
> Given that the main use cases for latency sensitive applications
> do not involve block I/O (data necessary for program operation is 
> locked in RAM), disable per-CPU buffer_head caches for isolated CPUs.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index a7fc561758b1..49e9160ce100 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -49,6 +49,7 @@
>  #include <trace/events/block.h>
>  #include <linux/fscrypt.h>
>  #include <linux/fsverity.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "internal.h"
>  
> @@ -1289,7 +1290,7 @@ static void bh_lru_install(struct buffer_head *bh)
>  	 * failing page migration.
>  	 * Skip putting upcoming bh into bh_lru until migration is done.
>  	 */
> -	if (lru_cache_disabled()) {
> +	if (lru_cache_disabled() || cpu_is_isolated(smp_processor_id())) {
>  		bh_lru_unlock();
>  		return;
>  	}
> @@ -1319,6 +1320,10 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
>  
>  	check_irqs_on();
>  	bh_lru_lock();
> +	if (cpu_is_isolated(smp_processor_id())) {
> +		bh_lru_unlock();
> +		return NULL;
> +	}
>  	for (i = 0; i < BH_LRU_SIZE; i++) {
>  		struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
>
Christian Brauner July 27, 2023, 9:18 a.m. UTC | #2
On Wed, Jul 26, 2023 at 11:31:26AM -0300, Marcelo Tosatti wrote:
> 
> Ping, apparently there is no objection to this patch...
> 
> Christian, what is the preferred tree for integration?

It'd be good if we could get an Ack from someone familiar with isolated
cpus for this; or just in general from someone who can ack this.
Marcelo Tosatti July 28, 2023, 7:35 p.m. UTC | #3
On Thu, Jul 27, 2023 at 11:18:11AM +0200, Christian Brauner wrote:
> On Wed, Jul 26, 2023 at 11:31:26AM -0300, Marcelo Tosatti wrote:
> > 
> > Ping, apparently there is no objection to this patch...
> > 
> > Christian, what is the preferred tree for integration?
> 
> It'd be good if we could get an Ack from someone familiar with isolated
> cpus for this; or just in general from someone who can ack this.

Frederic?
Frederic Weisbecker Aug. 4, 2023, 10:03 p.m. UTC | #4
On Tue, Jun 27, 2023 at 05:08:15PM -0300, Marcelo Tosatti wrote:
> 
> For certain types of applications (for example PLC software or
> RAN processing), upon occurrence of an event, it is necessary to
> complete a certain task in a maximum amount of time (deadline).
> 
> One way to express this requirement is with a pair of numbers,
> deadline time and execution time, where:
> 
>         * deadline time: length of time between event and deadline.
>         * execution time: length of time it takes for processing of event
>                           to occur on a particular hardware platform
>                           (uninterrupted).
> 
> The particular values depend on use-case. For the case
> where the realtime application executes in a virtualized
> guest, an IPI which must be serviced in the host will cause
> the following sequence of events:
> 
>         1) VM-exit
>         2) execution of IPI (and function call)
>         3) VM-entry
> 
> Which causes an excess of 50us latency as observed by cyclictest
> (this violates the latency requirement of vRAN application with 1ms TTI,
> for example).
> 
> invalidate_bh_lrus calls an IPI on each CPU that has non empty
> per-CPU cache:
> 
>         on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
> 
> The performance when using the per-CPU LRU cache is as follows:
> 
>  42 ns per __find_get_block
>  68 ns per __find_get_block_slow
> 
> Given that the main use cases for latency sensitive applications
> do not involve block I/O (data necessary for program operation is 
> locked in RAM), disable per-CPU buffer_head caches for isolated CPUs.

So what happens if they ever do I/O then? Like if they need to do
some prep work before entering an isolated critical section?

Thanks.

> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index a7fc561758b1..49e9160ce100 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -49,6 +49,7 @@
>  #include <trace/events/block.h>
>  #include <linux/fscrypt.h>
>  #include <linux/fsverity.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "internal.h"
>  
> @@ -1289,7 +1290,7 @@ static void bh_lru_install(struct buffer_head *bh)
>  	 * failing page migration.
>  	 * Skip putting upcoming bh into bh_lru until migration is done.
>  	 */
> -	if (lru_cache_disabled()) {
> +	if (lru_cache_disabled() || cpu_is_isolated(smp_processor_id())) {
>  		bh_lru_unlock();
>  		return;
>  	}
> @@ -1319,6 +1320,10 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
>  
>  	check_irqs_on();
>  	bh_lru_lock();
> +	if (cpu_is_isolated(smp_processor_id())) {
> +		bh_lru_unlock();
> +		return NULL;
> +	}
>  	for (i = 0; i < BH_LRU_SIZE; i++) {
>  		struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
>  
>
Marcelo Tosatti Aug. 4, 2023, 11:54 p.m. UTC | #5
On Sat, Aug 05, 2023 at 12:03:59AM +0200, Frederic Weisbecker wrote:
> On Tue, Jun 27, 2023 at 05:08:15PM -0300, Marcelo Tosatti wrote:
> > 
> > For certain types of applications (for example PLC software or
> > RAN processing), upon occurrence of an event, it is necessary to
> > complete a certain task in a maximum amount of time (deadline).
> > 
> > One way to express this requirement is with a pair of numbers,
> > deadline time and execution time, where:
> > 
> >         * deadline time: length of time between event and deadline.
> >         * execution time: length of time it takes for processing of event
> >                           to occur on a particular hardware platform
> >                           (uninterrupted).
> > 
> > The particular values depend on use-case. For the case
> > where the realtime application executes in a virtualized
> > guest, an IPI which must be serviced in the host will cause
> > the following sequence of events:
> > 
> >         1) VM-exit
> >         2) execution of IPI (and function call)
> >         3) VM-entry
> > 
> > Which causes an excess of 50us latency as observed by cyclictest
> > (this violates the latency requirement of vRAN application with 1ms TTI,
> > for example).
> > 
> > invalidate_bh_lrus calls an IPI on each CPU that has non empty
> > per-CPU cache:
> > 
> >         on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
> > 
> > The performance when using the per-CPU LRU cache is as follows:
> > 
> >  42 ns per __find_get_block
> >  68 ns per __find_get_block_slow
> > 
> > Given that the main use cases for latency sensitive applications
> > do not involve block I/O (data necessary for program operation is 
> > locked in RAM), disable per-CPU buffer_head caches for isolated CPUs.

Hi Frederic,

> So what happens if they ever do I/O then? Like if they need to do
> some prep work before entering an isolated critical section?

Then instead of going through the per-CPU LRU buffer_head cache
(__find_get_block), isolated CPUs will work as if their per-CPU
cache is always empty, going through the slowpath 
(__find_get_block_slow). The algorithm is:

/*
 * Perform a pagecache lookup for the matching buffer.  If it's there, refresh
 * it in the LRU and mark it as accessed.  If it is not present then return
 * NULL
 */
struct buffer_head *
__find_get_block(struct block_device *bdev, sector_t block, unsigned size)
{
        struct buffer_head *bh = lookup_bh_lru(bdev, block, size);

        if (bh == NULL) {
                /* __find_get_block_slow will mark the page accessed */
                bh = __find_get_block_slow(bdev, block);
                if (bh)
                        bh_lru_install(bh);
        } else
                touch_buffer(bh);

        return bh;
}
EXPORT_SYMBOL(__find_get_block);

I think the performance difference between the per-CPU LRU cache
VS __find_get_block_slow was much more significant when the cache 
was introduced. Nowadays its only 26ns (moreover modern filesystems 
do not use buffer_head's).

> Thanks.

Thank you for the review.

> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index a7fc561758b1..49e9160ce100 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -49,6 +49,7 @@
> >  #include <trace/events/block.h>
> >  #include <linux/fscrypt.h>
> >  #include <linux/fsverity.h>
> > +#include <linux/sched/isolation.h>
> >  
> >  #include "internal.h"
> >  
> > @@ -1289,7 +1290,7 @@ static void bh_lru_install(struct buffer_head *bh)
> >  	 * failing page migration.
> >  	 * Skip putting upcoming bh into bh_lru until migration is done.
> >  	 */
> > -	if (lru_cache_disabled()) {
> > +	if (lru_cache_disabled() || cpu_is_isolated(smp_processor_id())) {
> >  		bh_lru_unlock();
> >  		return;
> >  	}
> > @@ -1319,6 +1320,10 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
> >  
> >  	check_irqs_on();
> >  	bh_lru_lock();
> > +	if (cpu_is_isolated(smp_processor_id())) {
> > +		bh_lru_unlock();
> > +		return NULL;
> > +	}
> >  	for (i = 0; i < BH_LRU_SIZE; i++) {
> >  		struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
> >  
> > 
> 
>
Frederic Weisbecker Aug. 10, 2023, 10:36 a.m. UTC | #6
On Fri, Aug 04, 2023 at 08:54:37PM -0300, Marcelo Tosatti wrote:
> > So what happens if they ever do I/O then? Like if they need to do
> > some prep work before entering an isolated critical section?
> 
> Then instead of going through the per-CPU LRU buffer_head cache
> (__find_get_block), isolated CPUs will work as if their per-CPU
> cache is always empty, going through the slowpath 
> (__find_get_block_slow). The algorithm is:
> 
> /*
>  * Perform a pagecache lookup for the matching buffer.  If it's there, refresh
>  * it in the LRU and mark it as accessed.  If it is not present then return
>  * NULL
>  */
> struct buffer_head *
> __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
> {
>         struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
> 
>         if (bh == NULL) {
>                 /* __find_get_block_slow will mark the page accessed */
>                 bh = __find_get_block_slow(bdev, block);
>                 if (bh)
>                         bh_lru_install(bh);
>         } else
>                 touch_buffer(bh);
> 
>         return bh;
> }
> EXPORT_SYMBOL(__find_get_block);
> 
> I think the performance difference between the per-CPU LRU cache
> VS __find_get_block_slow was much more significant when the cache 
> was introduced. Nowadays its only 26ns (moreover modern filesystems 
> do not use buffer_head's).

Sounds good then!

Acked-by: Frederic Weisbecker <frederic@kernel.org>

Thanks!
Christian Brauner Aug. 10, 2023, 11:59 a.m. UTC | #7
On Tue, 27 Jun 2023 17:08:15 -0300, Marcelo Tosatti wrote:
> For certain types of applications (for example PLC software or
> RAN processing), upon occurrence of an event, it is necessary to
> complete a certain task in a maximum amount of time (deadline).
> 
> One way to express this requirement is with a pair of numbers,
> deadline time and execution time, where:
> 
> [...]

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs
      https://git.kernel.org/vfs/vfs/c/9ed7cfdf38b8
diff mbox series

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index a7fc561758b1..49e9160ce100 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -49,6 +49,7 @@ 
 #include <trace/events/block.h>
 #include <linux/fscrypt.h>
 #include <linux/fsverity.h>
+#include <linux/sched/isolation.h>
 
 #include "internal.h"
 
@@ -1289,7 +1290,7 @@  static void bh_lru_install(struct buffer_head *bh)
 	 * failing page migration.
 	 * Skip putting upcoming bh into bh_lru until migration is done.
 	 */
-	if (lru_cache_disabled()) {
+	if (lru_cache_disabled() || cpu_is_isolated(smp_processor_id())) {
 		bh_lru_unlock();
 		return;
 	}
@@ -1319,6 +1320,10 @@  lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
 
 	check_irqs_on();
 	bh_lru_lock();
+	if (cpu_is_isolated(smp_processor_id())) {
+		bh_lru_unlock();
+		return NULL;
+	}
 	for (i = 0; i < BH_LRU_SIZE; i++) {
 		struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);