diff mbox series

[PATCHv5,1/1] block: introduce content activity based ioprio

Message ID 20240130084207.3760518-1-zhaoyang.huang@unisoc.com (mailing list archive)
State New, archived
Headers show
Series [PATCHv5,1/1] block: introduce content activity based ioprio | expand

Commit Message

zhaoyang.huang Jan. 30, 2024, 8:42 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

Currently, request's ioprio are set via task's schedule priority(when no
blkcg configured), which has high priority tasks possess the privilege on
both of CPU and IO scheduling.
This commit works as a hint of original policy by promoting the request ioprio
based on the page/folio's activity. The original idea comes from LRU_GEN
which provides more precised folio activity than before. This commit try
to adjust the request's ioprio when certain part of its folios are hot,
which indicate that this request carry important contents and need be
scheduled ealier.

This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
by changing the bio_add_page/folio API in erofs, ext4 and f2fs in
another commit.

Case 1:
script[a] which get significant improved fault time as expected[b]
where dd's cost also shrink from 55s to 40s.
(1). fault_latency.bin is an ebpf based test tool which measure all task's
   iowait latency during page fault when scheduled out/in.
(2). costmem generate page fault by mmaping a file and access the VA.
(3). dd generate concurrent vfs io.

[a]
./fault_latency.bin 1 5 > /data/dd_costmem &
costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
dd if=/dev/block/sda of=/data/ddtest bs=1024 count=2048000 &
dd if=/dev/block/sda of=/data/ddtest1 bs=1024 count=2048000 &
dd if=/dev/block/sda of=/data/ddtest2 bs=1024 count=2048000 &
dd if=/dev/block/sda of=/data/ddtest3 bs=1024 count=2048000
[b]
                       mainline		commit
io wait                836us            156us

Case 2:
fio -filename=/dev/block/by-name/userdata -rw=randread -direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest
mainline: 513MiB/s
READ: bw=531MiB/s (557MB/s), 531MiB/s-531MiB/s (557MB/s-557MB/s), io=15.6GiB (16.8GB), run=30137-30137msec
READ: bw=543MiB/s (569MB/s), 543MiB/s-543MiB/s (569MB/s-569MB/s), io=15.6GiB (16.8GB), run=29469-29469msec
READ: bw=474MiB/s (497MB/s), 474MiB/s-474MiB/s (497MB/s-497MB/s), io=15.6GiB (16.8GB), run=33724-33724msec
READ: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=15.6GiB (16.8GB), run=29928-29928msec
READ: bw=523MiB/s (548MB/s), 523MiB/s-523MiB/s (548MB/s-548MB/s), io=15.6GiB (16.8GB), run=30617-30617msec
READ: bw=492MiB/s (516MB/s), 492MiB/s-492MiB/s (516MB/s-516MB/s), io=15.6GiB (16.8GB), run=32518-32518msec
READ: bw=533MiB/s (559MB/s), 533MiB/s-533MiB/s (559MB/s-559MB/s), io=15.6GiB (16.8GB), run=29993-29993msec
READ: bw=524MiB/s (550MB/s), 524MiB/s-524MiB/s (550MB/s-550MB/s), io=15.6GiB (16.8GB), run=30526-30526msec
READ: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=15.6GiB (16.8GB), run=30269-30269msec
READ: bw=449MiB/s (471MB/s), 449MiB/s-449MiB/s (471MB/s-471MB/s), io=15.6GiB (16.8GB), run=35629-35629msec

commit: 633MiB/s
READ: bw=668MiB/s (700MB/s), 668MiB/s-668MiB/s (700MB/s-700MB/s), io=15.6GiB (16.8GB), run=23952-23952msec
READ: bw=589MiB/s (618MB/s), 589MiB/s-589MiB/s (618MB/s-618MB/s), io=15.6GiB (16.8GB), run=27164-27164msec
READ: bw=638MiB/s (669MB/s), 638MiB/s-638MiB/s (669MB/s-669MB/s), io=15.6GiB (16.8GB), run=25071-25071msec
READ: bw=714MiB/s (749MB/s), 714MiB/s-714MiB/s (749MB/s-749MB/s), io=15.6GiB (16.8GB), run=22409-22409msec
READ: bw=600MiB/s (629MB/s), 600MiB/s-600MiB/s (629MB/s-629MB/s), io=15.6GiB (16.8GB), run=26669-26669msec
READ: bw=592MiB/s (621MB/s), 592MiB/s-592MiB/s (621MB/s-621MB/s), io=15.6GiB (16.8GB), run=27036-27036msec
READ: bw=691MiB/s (725MB/s), 691MiB/s-691MiB/s (725MB/s-725MB/s), io=15.6GiB (16.8GB), run=23150-23150msec
READ: bw=569MiB/s (596MB/s), 569MiB/s-569MiB/s (596MB/s-596MB/s), io=15.6GiB (16.8GB), run=28142-28142msec
READ: bw=563MiB/s (590MB/s), 563MiB/s-563MiB/s (590MB/s-590MB/s), io=15.6GiB (16.8GB), run=28429-28429msec
READ: bw=712MiB/s (746MB/s), 712MiB/s-712MiB/s (746MB/s-746MB/s), io=15.6GiB (16.8GB), run=22478-22478msec

Case 3:
This commit is also verified by the case of launching camera APP which is
usually considered as heavy working load on both of memory and IO, which
shows 12%-24% improvement.

		ttl = 0		ttl = 50	ttl = 100
mainline        2267ms		2420ms		2316ms
commit          1992ms          1806ms          1998ms

case 4:
androbench has no improvment as well as regression which supposed to be
its test time is short which MGLRU hasn't take effect yet.

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
change of v2: calculate page's activity via helper function
change of v3: solve layer violation by move API into mm
change of v4: keep block clean by removing the page related API
change of v5: introduce the macros of bio_add_folio/page for read dir.
---
---
 include/linux/act_ioprio.h  | 60 +++++++++++++++++++++++++++++++++++++
 include/uapi/linux/ioprio.h | 38 +++++++++++++++++++++++
 mm/Kconfig                  |  8 +++++
 3 files changed, 106 insertions(+)
 create mode 100644 include/linux/act_ioprio.h

Comments

Damien Le Moal Jan. 30, 2024, 9:17 a.m. UTC | #1
On 1/30/24 17:42, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> Currently, request's ioprio are set via task's schedule priority(when no
> blkcg configured), which has high priority tasks possess the privilege on
> both of CPU and IO scheduling.
> This commit works as a hint of original policy by promoting the request ioprio
> based on the page/folio's activity. The original idea comes from LRU_GEN
> which provides more precised folio activity than before. This commit try
> to adjust the request's ioprio when certain part of its folios are hot,
> which indicate that this request carry important contents and need be
> scheduled ealier.
> 
> This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
> by changing the bio_add_page/folio API in erofs, ext4 and f2fs in
> another commit.
> 
> Case 1:
> script[a] which get significant improved fault time as expected[b]
> where dd's cost also shrink from 55s to 40s.
> (1). fault_latency.bin is an ebpf based test tool which measure all task's
>    iowait latency during page fault when scheduled out/in.
> (2). costmem generate page fault by mmaping a file and access the VA.
> (3). dd generate concurrent vfs io.
> 
> [a]
> ./fault_latency.bin 1 5 > /data/dd_costmem &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> dd if=/dev/block/sda of=/data/ddtest bs=1024 count=2048000 &
> dd if=/dev/block/sda of=/data/ddtest1 bs=1024 count=2048000 &
> dd if=/dev/block/sda of=/data/ddtest2 bs=1024 count=2048000 &
> dd if=/dev/block/sda of=/data/ddtest3 bs=1024 count=2048000
> [b]
>                        mainline		commit
> io wait                836us            156us
> 
> Case 2:
> fio -filename=/dev/block/by-name/userdata -rw=randread -direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest
> mainline: 513MiB/s
> READ: bw=531MiB/s (557MB/s), 531MiB/s-531MiB/s (557MB/s-557MB/s), io=15.6GiB (16.8GB), run=30137-30137msec
> READ: bw=543MiB/s (569MB/s), 543MiB/s-543MiB/s (569MB/s-569MB/s), io=15.6GiB (16.8GB), run=29469-29469msec
> READ: bw=474MiB/s (497MB/s), 474MiB/s-474MiB/s (497MB/s-497MB/s), io=15.6GiB (16.8GB), run=33724-33724msec
> READ: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=15.6GiB (16.8GB), run=29928-29928msec
> READ: bw=523MiB/s (548MB/s), 523MiB/s-523MiB/s (548MB/s-548MB/s), io=15.6GiB (16.8GB), run=30617-30617msec
> READ: bw=492MiB/s (516MB/s), 492MiB/s-492MiB/s (516MB/s-516MB/s), io=15.6GiB (16.8GB), run=32518-32518msec
> READ: bw=533MiB/s (559MB/s), 533MiB/s-533MiB/s (559MB/s-559MB/s), io=15.6GiB (16.8GB), run=29993-29993msec
> READ: bw=524MiB/s (550MB/s), 524MiB/s-524MiB/s (550MB/s-550MB/s), io=15.6GiB (16.8GB), run=30526-30526msec
> READ: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=15.6GiB (16.8GB), run=30269-30269msec
> READ: bw=449MiB/s (471MB/s), 449MiB/s-449MiB/s (471MB/s-471MB/s), io=15.6GiB (16.8GB), run=35629-35629msec
> 
> commit: 633MiB/s
> READ: bw=668MiB/s (700MB/s), 668MiB/s-668MiB/s (700MB/s-700MB/s), io=15.6GiB (16.8GB), run=23952-23952msec
> READ: bw=589MiB/s (618MB/s), 589MiB/s-589MiB/s (618MB/s-618MB/s), io=15.6GiB (16.8GB), run=27164-27164msec
> READ: bw=638MiB/s (669MB/s), 638MiB/s-638MiB/s (669MB/s-669MB/s), io=15.6GiB (16.8GB), run=25071-25071msec
> READ: bw=714MiB/s (749MB/s), 714MiB/s-714MiB/s (749MB/s-749MB/s), io=15.6GiB (16.8GB), run=22409-22409msec
> READ: bw=600MiB/s (629MB/s), 600MiB/s-600MiB/s (629MB/s-629MB/s), io=15.6GiB (16.8GB), run=26669-26669msec
> READ: bw=592MiB/s (621MB/s), 592MiB/s-592MiB/s (621MB/s-621MB/s), io=15.6GiB (16.8GB), run=27036-27036msec
> READ: bw=691MiB/s (725MB/s), 691MiB/s-691MiB/s (725MB/s-725MB/s), io=15.6GiB (16.8GB), run=23150-23150msec
> READ: bw=569MiB/s (596MB/s), 569MiB/s-569MiB/s (596MB/s-596MB/s), io=15.6GiB (16.8GB), run=28142-28142msec
> READ: bw=563MiB/s (590MB/s), 563MiB/s-563MiB/s (590MB/s-590MB/s), io=15.6GiB (16.8GB), run=28429-28429msec
> READ: bw=712MiB/s (746MB/s), 712MiB/s-712MiB/s (746MB/s-746MB/s), io=15.6GiB (16.8GB), run=22478-22478msec
> 
> Case 3:
> This commit is also verified by the case of launching camera APP which is
> usually considered as heavy working load on both of memory and IO, which
> shows 12%-24% improvement.
> 
> 		ttl = 0		ttl = 50	ttl = 100
> mainline        2267ms		2420ms		2316ms
> commit          1992ms          1806ms          1998ms
> 
> case 4:
> androbench has no improvment as well as regression which supposed to be
> its test time is short which MGLRU hasn't take effect yet.
> 
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> ---
> change of v2: calculate page's activity via helper function
> change of v3: solve layer violation by move API into mm
> change of v4: keep block clean by removing the page related API
> change of v5: introduce the macros of bio_add_folio/page for read dir.
> ---
> ---
>  include/linux/act_ioprio.h  | 60 +++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/ioprio.h | 38 +++++++++++++++++++++++
>  mm/Kconfig                  |  8 +++++
>  3 files changed, 106 insertions(+)
>  create mode 100644 include/linux/act_ioprio.h
> 
> diff --git a/include/linux/act_ioprio.h b/include/linux/act_ioprio.h
> new file mode 100644
> index 000000000000..ca7309b85758
> --- /dev/null
> +++ b/include/linux/act_ioprio.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _ACT_IOPRIO_H
> +#define _ACT_IOPRIO_H
> +
> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> +#include <linux/bio.h>
> +
> +static __maybe_unused
> +bool act_bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
> +		size_t off)
> +{
> +	int class, level, hint, activity;
> +	bool ret;
> +
> +	ret = bio_add_folio(bio, folio, len, off);
> +	if (bio_op(bio) == REQ_OP_READ && ret) {
> +		class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> +		level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> +		hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> +		activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> +		activity += (activity < IOPRIO_NR_ACTIVITY &&
> +				folio_test_workingset(folio)) ? 1 : 0;
> +		if (activity >= bio->bi_vcnt / 2)
> +			class = IOPRIO_CLASS_RT;
> +		else if (activity >= bio->bi_vcnt / 4)
> +			class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
> +		activity = min(IOPRIO_NR_ACTIVITY - 1, activity);
> +		bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> +	}
> +	return ret;
> +}

Big non-inline functions in a header file... That is unusual, to say the least.
So every FS that includes this will get its own copy of the binary for these
functions. That is not exactly optimal.

> +
> +static __maybe_unused
> +int act_bio_add_page(struct bio *bio, struct page *page,
> +		unsigned int len, unsigned int offset)
> +{
> +	int class, level, hint, activity;
> +	int ret = 0;
> +
> +	ret = bio_add_page(bio, page, len, offset);
> +	if (bio_op(bio) == REQ_OP_READ && ret > 0) {
> +		class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> +		level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> +		hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> +		activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> +		activity += (activity < IOPRIO_NR_ACTIVITY &&
> +				PageWorkingset(page)) ? 1 : 0;
> +		if (activity >= bio->bi_vcnt / 2)
> +			class = IOPRIO_CLASS_RT;
> +		else if (activity >= bio->bi_vcnt / 4)
> +			class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
> +		activity = min(IOPRIO_NR_ACTIVITY - 1, activity);
> +		bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> +	}
> +	return ret;
> +}
> +#define bio_add_folio(bio, folio, len, off)     act_bio_add_folio(bio, folio, len, off)
> +#define bio_add_page(bio, page, len, offset)    act_bio_add_page(bio, page, len, offset)

These functions are *NOT* part of the block layer. So please do not pretend they
are. Why don't you simply write a function equivalent to what you have inside
the "if" above and have the FS call that after bio_add_Page() ?

And I seriously doubt that all compilers will be happy with these macro names
clashing with real function names...

> +#endif
> +#endif
> diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
> index bee2bdb0eedb..64cf5ff0ac5f 100644
> --- a/include/uapi/linux/ioprio.h
> +++ b/include/uapi/linux/ioprio.h
> @@ -71,12 +71,24 @@ enum {
>   * class and level.
>   */
>  #define IOPRIO_HINT_SHIFT		IOPRIO_LEVEL_NR_BITS
> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> +#define IOPRIO_HINT_NR_BITS		3
> +#else
>  #define IOPRIO_HINT_NR_BITS		10
> +#endif
>  #define IOPRIO_NR_HINTS			(1 << IOPRIO_HINT_NR_BITS)
>  #define IOPRIO_HINT_MASK		(IOPRIO_NR_HINTS - 1)
>  #define IOPRIO_PRIO_HINT(ioprio)	\
>  	(((ioprio) >> IOPRIO_HINT_SHIFT) & IOPRIO_HINT_MASK)
>  
> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> +#define IOPRIO_ACTIVITY_SHIFT		(IOPRIO_HINT_NR_BITS + IOPRIO_LEVEL_NR_BITS)
> +#define IOPRIO_ACTIVITY_NR_BITS		7

I already told you that taking all the free hint bits for yourself, leaving no
room fo future IO hints, is not nice. Do you really need 7 bits for your thing ?
Why does the activity even need to be part of the IO priority ? From the rather
short explanation in the commit message, it seems that activity should simply
raise the priority (either class or level or both). I do not see why that
activity number needs to be in the ioprio. Who in the kernel will look at it ?
IO scheduler ? the storage device ?

> +#define IOPRIO_NR_ACTIVITY		(1 << IOPRIO_ACTIVITY_NR_BITS)
> +#define IOPRIO_ACTIVITY_MASK		(IOPRIO_NR_ACTIVITY - 1)
> +#define IOPRIO_PRIO_ACTIVITY(ioprio)	\
> +	(((ioprio) >> IOPRIO_ACTIVITY_SHIFT) & IOPRIO_ACTIVITY_MASK)
> +#endif
>  /*
>   * I/O hints.
>   */
> @@ -104,6 +116,7 @@ enum {
>  
>  #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))
>  
> +#ifndef CONFIG_CONTENT_ACT_BASED_IOPRIO
>  /*
>   * Return an I/O priority value based on a class, a level and a hint.
>   */
> @@ -123,5 +136,30 @@ static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
>  	ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE)
>  #define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint)	\
>  	ioprio_value(prioclass, priolevel, priohint)
> +#else
> +/*
> + * Return an I/O priority value based on a class, a level, a hint and
> + * content's activities
> + */
> +static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
> +		int priohint, int activity)
> +{
> +	if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
> +			IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
> +			IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS) ||
> +			IOPRIO_BAD_VALUE(activity, IOPRIO_NR_ACTIVITY))
> +		return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
>  
> +	return (prioclass << IOPRIO_CLASS_SHIFT) |
> +		(activity << IOPRIO_ACTIVITY_SHIFT) |
> +		(priohint << IOPRIO_HINT_SHIFT) | priolevel;
> +}
> +
> +#define IOPRIO_PRIO_VALUE(prioclass, priolevel)			\
> +	ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE, 0)
> +#define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint)	\
> +	ioprio_value(prioclass, priolevel, priohint, 0)
> +#define IOPRIO_PRIO_VALUE_ACTIVITY(prioclass, priolevel, priohint, activity)	\
> +	ioprio_value(prioclass, priolevel, priohint, activity)
> +#endif
>  #endif /* _UAPI_LINUX_IOPRIO_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 264a2df5ecf5..e0e5a5a44ded 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1240,6 +1240,14 @@ config LRU_GEN_STATS
>  	  from evicted generations for debugging purpose.
>  
>  	  This option has a per-memcg and per-node memory overhead.
> +
> +config CONTENT_ACT_BASED_IOPRIO
> +	bool "Enable content activity based ioprio"
> +	depends on LRU_GEN
> +	default n
> +	help
> +	  This item enable the feature of adjust bio's priority by
> +	  calculating its content's activity.
>  # }
>  
>  config ARCH_SUPPORTS_PER_VMA_LOCK
Zhaoyang Huang Jan. 30, 2024, 12:43 p.m. UTC | #2
On Tue, Jan 30, 2024 at 5:17 PM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 1/30/24 17:42, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > Currently, request's ioprio are set via task's schedule priority(when no
> > blkcg configured), which has high priority tasks possess the privilege on
> > both of CPU and IO scheduling.
> > This commit works as a hint of original policy by promoting the request ioprio
> > based on the page/folio's activity. The original idea comes from LRU_GEN
> > which provides more precised folio activity than before. This commit try
> > to adjust the request's ioprio when certain part of its folios are hot,
> > which indicate that this request carry important contents and need be
> > scheduled ealier.
> >
> > This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
> > by changing the bio_add_page/folio API in erofs, ext4 and f2fs in
> > another commit.
> >
> > Case 1:
> > script[a] which get significant improved fault time as expected[b]
> > where dd's cost also shrink from 55s to 40s.
> > (1). fault_latency.bin is an ebpf based test tool which measure all task's
> >    iowait latency during page fault when scheduled out/in.
> > (2). costmem generate page fault by mmaping a file and access the VA.
> > (3). dd generate concurrent vfs io.
> >
> > [a]
> > ./fault_latency.bin 1 5 > /data/dd_costmem &
> > costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> > costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> > costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> > costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> > dd if=/dev/block/sda of=/data/ddtest bs=1024 count=2048000 &
> > dd if=/dev/block/sda of=/data/ddtest1 bs=1024 count=2048000 &
> > dd if=/dev/block/sda of=/data/ddtest2 bs=1024 count=2048000 &
> > dd if=/dev/block/sda of=/data/ddtest3 bs=1024 count=2048000
> > [b]
> >                        mainline               commit
> > io wait                836us            156us
> >
> > Case 2:
> > fio -filename=/dev/block/by-name/userdata -rw=randread -direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest
> > mainline: 513MiB/s
> > READ: bw=531MiB/s (557MB/s), 531MiB/s-531MiB/s (557MB/s-557MB/s), io=15.6GiB (16.8GB), run=30137-30137msec
> > READ: bw=543MiB/s (569MB/s), 543MiB/s-543MiB/s (569MB/s-569MB/s), io=15.6GiB (16.8GB), run=29469-29469msec
> > READ: bw=474MiB/s (497MB/s), 474MiB/s-474MiB/s (497MB/s-497MB/s), io=15.6GiB (16.8GB), run=33724-33724msec
> > READ: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=15.6GiB (16.8GB), run=29928-29928msec
> > READ: bw=523MiB/s (548MB/s), 523MiB/s-523MiB/s (548MB/s-548MB/s), io=15.6GiB (16.8GB), run=30617-30617msec
> > READ: bw=492MiB/s (516MB/s), 492MiB/s-492MiB/s (516MB/s-516MB/s), io=15.6GiB (16.8GB), run=32518-32518msec
> > READ: bw=533MiB/s (559MB/s), 533MiB/s-533MiB/s (559MB/s-559MB/s), io=15.6GiB (16.8GB), run=29993-29993msec
> > READ: bw=524MiB/s (550MB/s), 524MiB/s-524MiB/s (550MB/s-550MB/s), io=15.6GiB (16.8GB), run=30526-30526msec
> > READ: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=15.6GiB (16.8GB), run=30269-30269msec
> > READ: bw=449MiB/s (471MB/s), 449MiB/s-449MiB/s (471MB/s-471MB/s), io=15.6GiB (16.8GB), run=35629-35629msec
> >
> > commit: 633MiB/s
> > READ: bw=668MiB/s (700MB/s), 668MiB/s-668MiB/s (700MB/s-700MB/s), io=15.6GiB (16.8GB), run=23952-23952msec
> > READ: bw=589MiB/s (618MB/s), 589MiB/s-589MiB/s (618MB/s-618MB/s), io=15.6GiB (16.8GB), run=27164-27164msec
> > READ: bw=638MiB/s (669MB/s), 638MiB/s-638MiB/s (669MB/s-669MB/s), io=15.6GiB (16.8GB), run=25071-25071msec
> > READ: bw=714MiB/s (749MB/s), 714MiB/s-714MiB/s (749MB/s-749MB/s), io=15.6GiB (16.8GB), run=22409-22409msec
> > READ: bw=600MiB/s (629MB/s), 600MiB/s-600MiB/s (629MB/s-629MB/s), io=15.6GiB (16.8GB), run=26669-26669msec
> > READ: bw=592MiB/s (621MB/s), 592MiB/s-592MiB/s (621MB/s-621MB/s), io=15.6GiB (16.8GB), run=27036-27036msec
> > READ: bw=691MiB/s (725MB/s), 691MiB/s-691MiB/s (725MB/s-725MB/s), io=15.6GiB (16.8GB), run=23150-23150msec
> > READ: bw=569MiB/s (596MB/s), 569MiB/s-569MiB/s (596MB/s-596MB/s), io=15.6GiB (16.8GB), run=28142-28142msec
> > READ: bw=563MiB/s (590MB/s), 563MiB/s-563MiB/s (590MB/s-590MB/s), io=15.6GiB (16.8GB), run=28429-28429msec
> > READ: bw=712MiB/s (746MB/s), 712MiB/s-712MiB/s (746MB/s-746MB/s), io=15.6GiB (16.8GB), run=22478-22478msec
> >
> > Case 3:
> > This commit is also verified by the case of launching camera APP which is
> > usually considered as heavy working load on both of memory and IO, which
> > shows 12%-24% improvement.
> >
> >               ttl = 0         ttl = 50        ttl = 100
> > mainline        2267ms                2420ms          2316ms
> > commit          1992ms          1806ms          1998ms
> >
> > case 4:
> > androbench has no improvment as well as regression which supposed to be
> > its test time is short which MGLRU hasn't take effect yet.
> >
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > ---
> > change of v2: calculate page's activity via helper function
> > change of v3: solve layer violation by move API into mm
> > change of v4: keep block clean by removing the page related API
> > change of v5: introduce the macros of bio_add_folio/page for read dir.
> > ---
> > ---
> >  include/linux/act_ioprio.h  | 60 +++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/ioprio.h | 38 +++++++++++++++++++++++
> >  mm/Kconfig                  |  8 +++++
> >  3 files changed, 106 insertions(+)
> >  create mode 100644 include/linux/act_ioprio.h
> >
> > diff --git a/include/linux/act_ioprio.h b/include/linux/act_ioprio.h
> > new file mode 100644
> > index 000000000000..ca7309b85758
> > --- /dev/null
> > +++ b/include/linux/act_ioprio.h
> > @@ -0,0 +1,60 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _ACT_IOPRIO_H
> > +#define _ACT_IOPRIO_H
> > +
> > +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> > +#include <linux/bio.h>
> > +
> > +static __maybe_unused
> > +bool act_bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
> > +             size_t off)
> > +{
> > +     int class, level, hint, activity;
> > +     bool ret;
> > +
> > +     ret = bio_add_folio(bio, folio, len, off);
> > +     if (bio_op(bio) == REQ_OP_READ && ret) {
> > +             class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> > +             level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> > +             hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> > +             activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> > +             activity += (activity < IOPRIO_NR_ACTIVITY &&
> > +                             folio_test_workingset(folio)) ? 1 : 0;
> > +             if (activity >= bio->bi_vcnt / 2)
> > +                     class = IOPRIO_CLASS_RT;
> > +             else if (activity >= bio->bi_vcnt / 4)
> > +                     class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
> > +             activity = min(IOPRIO_NR_ACTIVITY - 1, activity);
> > +             bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> > +     }
> > +     return ret;
> > +}
>
> Big non-inline functions in a header file... That is unusual, to say the least.
> So every FS that includes this will get its own copy of the binary for these
> functions. That is not exactly optimal.
Thanks for quick reply:D
This is a trade-off method for having both the block layer and fs be
clean and do no modification. There is less calling bio_add_xxx within
fs actually.
>
> > +
> > +static __maybe_unused
> > +int act_bio_add_page(struct bio *bio, struct page *page,
> > +             unsigned int len, unsigned int offset)
> > +{
> > +     int class, level, hint, activity;
> > +     int ret = 0;
> > +
> > +     ret = bio_add_page(bio, page, len, offset);
> > +     if (bio_op(bio) == REQ_OP_READ && ret > 0) {
> > +             class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> > +             level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> > +             hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> > +             activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> > +             activity += (activity < IOPRIO_NR_ACTIVITY &&
> > +                             PageWorkingset(page)) ? 1 : 0;
> > +             if (activity >= bio->bi_vcnt / 2)
> > +                     class = IOPRIO_CLASS_RT;
> > +             else if (activity >= bio->bi_vcnt / 4)
> > +                     class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
> > +             activity = min(IOPRIO_NR_ACTIVITY - 1, activity);
> > +             bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> > +     }
> > +     return ret;
> > +}
> > +#define bio_add_folio(bio, folio, len, off)     act_bio_add_folio(bio, folio, len, off)
> > +#define bio_add_page(bio, page, len, offset)    act_bio_add_page(bio, page, len, offset)
>
> These functions are *NOT* part of the block layer. So please do not pretend they
> are. Why don't you simply write a function equivalent to what you have inside
> the "if" above and have the FS call that after bio_add_Page() ?
The iteration of bio is costly(could be maximum to 256 pages) and
needs fs's code modification. I will implement a version as you
suggested.
>
> And I seriously doubt that all compilers will be happy with these macro names
> clashing with real function names...
>
> > +#endif
> > +#endif
> > diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
> > index bee2bdb0eedb..64cf5ff0ac5f 100644
> > --- a/include/uapi/linux/ioprio.h
> > +++ b/include/uapi/linux/ioprio.h
> > @@ -71,12 +71,24 @@ enum {
> >   * class and level.
> >   */
> >  #define IOPRIO_HINT_SHIFT            IOPRIO_LEVEL_NR_BITS
> > +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> > +#define IOPRIO_HINT_NR_BITS          3
> > +#else
> >  #define IOPRIO_HINT_NR_BITS          10
> > +#endif
> >  #define IOPRIO_NR_HINTS                      (1 << IOPRIO_HINT_NR_BITS)
> >  #define IOPRIO_HINT_MASK             (IOPRIO_NR_HINTS - 1)
> >  #define IOPRIO_PRIO_HINT(ioprio)     \
> >       (((ioprio) >> IOPRIO_HINT_SHIFT) & IOPRIO_HINT_MASK)
> >
> > +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> > +#define IOPRIO_ACTIVITY_SHIFT                (IOPRIO_HINT_NR_BITS + IOPRIO_LEVEL_NR_BITS)
> > +#define IOPRIO_ACTIVITY_NR_BITS              7
>
> I already told you that taking all the free hint bits for yourself, leaving no
> room fo future IO hints, is not nice. Do you really need 7 bits for your thing ?
> Why does the activity even need to be part of the IO priority ? From the rather
> short explanation in the commit message, it seems that activity should simply
> raise the priority (either class or level or both). I do not see why that
> activity number needs to be in the ioprio. Who in the kernel will look at it ?
> IO scheduler ? the storage device ?
As I explained above, 7 bits(128 of 256) within ioprio is the minimum
number for counting active pages carried by this bio and will end at
the IO scheduler. bio has to be enlarged a new member to log these if
we don't use ioprio.
>
> > +#define IOPRIO_NR_ACTIVITY           (1 << IOPRIO_ACTIVITY_NR_BITS)
> > +#define IOPRIO_ACTIVITY_MASK         (IOPRIO_NR_ACTIVITY - 1)
> > +#define IOPRIO_PRIO_ACTIVITY(ioprio) \
> > +     (((ioprio) >> IOPRIO_ACTIVITY_SHIFT) & IOPRIO_ACTIVITY_MASK)
> > +#endif
> >  /*
> >   * I/O hints.
> >   */
> > @@ -104,6 +116,7 @@ enum {
> >
> >  #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))
> >
> > +#ifndef CONFIG_CONTENT_ACT_BASED_IOPRIO
> >  /*
> >   * Return an I/O priority value based on a class, a level and a hint.
> >   */
> > @@ -123,5 +136,30 @@ static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
> >       ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE)
> >  #define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint)       \
> >       ioprio_value(prioclass, priolevel, priohint)
> > +#else
> > +/*
> > + * Return an I/O priority value based on a class, a level, a hint and
> > + * content's activities
> > + */
> > +static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
> > +             int priohint, int activity)
> > +{
> > +     if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
> > +                     IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
> > +                     IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS) ||
> > +                     IOPRIO_BAD_VALUE(activity, IOPRIO_NR_ACTIVITY))
> > +             return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
> >
> > +     return (prioclass << IOPRIO_CLASS_SHIFT) |
> > +             (activity << IOPRIO_ACTIVITY_SHIFT) |
> > +             (priohint << IOPRIO_HINT_SHIFT) | priolevel;
> > +}
> > +
> > +#define IOPRIO_PRIO_VALUE(prioclass, priolevel)                      \
> > +     ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE, 0)
> > +#define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint)       \
> > +     ioprio_value(prioclass, priolevel, priohint, 0)
> > +#define IOPRIO_PRIO_VALUE_ACTIVITY(prioclass, priolevel, priohint, activity) \
> > +     ioprio_value(prioclass, priolevel, priohint, activity)
> > +#endif
> >  #endif /* _UAPI_LINUX_IOPRIO_H */
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 264a2df5ecf5..e0e5a5a44ded 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1240,6 +1240,14 @@ config LRU_GEN_STATS
> >         from evicted generations for debugging purpose.
> >
> >         This option has a per-memcg and per-node memory overhead.
> > +
> > +config CONTENT_ACT_BASED_IOPRIO
> > +     bool "Enable content activity based ioprio"
> > +     depends on LRU_GEN
> > +     default n
> > +     help
> > +       This item enable the feature of adjust bio's priority by
> > +       calculating its content's activity.
> >  # }
> >
> >  config ARCH_SUPPORTS_PER_VMA_LOCK
>
> --
> Damien Le Moal
> Western Digital Research
>
Damien Le Moal Jan. 30, 2024, 1:17 p.m. UTC | #3
On 1/30/24 21:43, Zhaoyang Huang wrote:
> On Tue, Jan 30, 2024 at 5:17 PM Damien Le Moal <dlemoal@kernel.org> wrote:
>>
>> On 1/30/24 17:42, zhaoyang.huang wrote:
>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>>
>>> Currently, request's ioprio are set via task's schedule priority(when no
>>> blkcg configured), which has high priority tasks possess the privilege on
>>> both of CPU and IO scheduling.
>>> This commit works as a hint of original policy by promoting the request ioprio
>>> based on the page/folio's activity. The original idea comes from LRU_GEN
>>> which provides more precised folio activity than before. This commit try
>>> to adjust the request's ioprio when certain part of its folios are hot,
>>> which indicate that this request carry important contents and need be
>>> scheduled ealier.
>>>
>>> This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
>>> by changing the bio_add_page/folio API in erofs, ext4 and f2fs in
>>> another commit.
>>>
>>> Case 1:
>>> script[a] which get significant improved fault time as expected[b]
>>> where dd's cost also shrink from 55s to 40s.
>>> (1). fault_latency.bin is an ebpf based test tool which measure all task's
>>>    iowait latency during page fault when scheduled out/in.
>>> (2). costmem generate page fault by mmaping a file and access the VA.
>>> (3). dd generate concurrent vfs io.
>>>
>>> [a]
>>> ./fault_latency.bin 1 5 > /data/dd_costmem &
>>> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
>>> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
>>> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
>>> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
>>> dd if=/dev/block/sda of=/data/ddtest bs=1024 count=2048000 &
>>> dd if=/dev/block/sda of=/data/ddtest1 bs=1024 count=2048000 &
>>> dd if=/dev/block/sda of=/data/ddtest2 bs=1024 count=2048000 &
>>> dd if=/dev/block/sda of=/data/ddtest3 bs=1024 count=2048000
>>> [b]
>>>                        mainline               commit
>>> io wait                836us            156us
>>>
>>> Case 2:
>>> fio -filename=/dev/block/by-name/userdata -rw=randread -direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest
>>> mainline: 513MiB/s
>>> READ: bw=531MiB/s (557MB/s), 531MiB/s-531MiB/s (557MB/s-557MB/s), io=15.6GiB (16.8GB), run=30137-30137msec
>>> READ: bw=543MiB/s (569MB/s), 543MiB/s-543MiB/s (569MB/s-569MB/s), io=15.6GiB (16.8GB), run=29469-29469msec
>>> READ: bw=474MiB/s (497MB/s), 474MiB/s-474MiB/s (497MB/s-497MB/s), io=15.6GiB (16.8GB), run=33724-33724msec
>>> READ: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=15.6GiB (16.8GB), run=29928-29928msec
>>> READ: bw=523MiB/s (548MB/s), 523MiB/s-523MiB/s (548MB/s-548MB/s), io=15.6GiB (16.8GB), run=30617-30617msec
>>> READ: bw=492MiB/s (516MB/s), 492MiB/s-492MiB/s (516MB/s-516MB/s), io=15.6GiB (16.8GB), run=32518-32518msec
>>> READ: bw=533MiB/s (559MB/s), 533MiB/s-533MiB/s (559MB/s-559MB/s), io=15.6GiB (16.8GB), run=29993-29993msec
>>> READ: bw=524MiB/s (550MB/s), 524MiB/s-524MiB/s (550MB/s-550MB/s), io=15.6GiB (16.8GB), run=30526-30526msec
>>> READ: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=15.6GiB (16.8GB), run=30269-30269msec
>>> READ: bw=449MiB/s (471MB/s), 449MiB/s-449MiB/s (471MB/s-471MB/s), io=15.6GiB (16.8GB), run=35629-35629msec
>>>
>>> commit: 633MiB/s
>>> READ: bw=668MiB/s (700MB/s), 668MiB/s-668MiB/s (700MB/s-700MB/s), io=15.6GiB (16.8GB), run=23952-23952msec
>>> READ: bw=589MiB/s (618MB/s), 589MiB/s-589MiB/s (618MB/s-618MB/s), io=15.6GiB (16.8GB), run=27164-27164msec
>>> READ: bw=638MiB/s (669MB/s), 638MiB/s-638MiB/s (669MB/s-669MB/s), io=15.6GiB (16.8GB), run=25071-25071msec
>>> READ: bw=714MiB/s (749MB/s), 714MiB/s-714MiB/s (749MB/s-749MB/s), io=15.6GiB (16.8GB), run=22409-22409msec
>>> READ: bw=600MiB/s (629MB/s), 600MiB/s-600MiB/s (629MB/s-629MB/s), io=15.6GiB (16.8GB), run=26669-26669msec
>>> READ: bw=592MiB/s (621MB/s), 592MiB/s-592MiB/s (621MB/s-621MB/s), io=15.6GiB (16.8GB), run=27036-27036msec
>>> READ: bw=691MiB/s (725MB/s), 691MiB/s-691MiB/s (725MB/s-725MB/s), io=15.6GiB (16.8GB), run=23150-23150msec
>>> READ: bw=569MiB/s (596MB/s), 569MiB/s-569MiB/s (596MB/s-596MB/s), io=15.6GiB (16.8GB), run=28142-28142msec
>>> READ: bw=563MiB/s (590MB/s), 563MiB/s-563MiB/s (590MB/s-590MB/s), io=15.6GiB (16.8GB), run=28429-28429msec
>>> READ: bw=712MiB/s (746MB/s), 712MiB/s-712MiB/s (746MB/s-746MB/s), io=15.6GiB (16.8GB), run=22478-22478msec
>>>
>>> Case 3:
>>> This commit is also verified by the case of launching camera APP which is
>>> usually considered as heavy working load on both of memory and IO, which
>>> shows 12%-24% improvement.
>>>
>>>               ttl = 0         ttl = 50        ttl = 100
>>> mainline        2267ms                2420ms          2316ms
>>> commit          1992ms          1806ms          1998ms
>>>
>>> case 4:
>>> androbench has no improvment as well as regression which supposed to be
>>> its test time is short which MGLRU hasn't take effect yet.
>>>
>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>> ---
>>> change of v2: calculate page's activity via helper function
>>> change of v3: solve layer violation by move API into mm
>>> change of v4: keep block clean by removing the page related API
>>> change of v5: introduce the macros of bio_add_folio/page for read dir.
>>> ---
>>> ---
>>>  include/linux/act_ioprio.h  | 60 +++++++++++++++++++++++++++++++++++++
>>>  include/uapi/linux/ioprio.h | 38 +++++++++++++++++++++++
>>>  mm/Kconfig                  |  8 +++++
>>>  3 files changed, 106 insertions(+)
>>>  create mode 100644 include/linux/act_ioprio.h
>>>
>>> diff --git a/include/linux/act_ioprio.h b/include/linux/act_ioprio.h
>>> new file mode 100644
>>> index 000000000000..ca7309b85758
>>> --- /dev/null
>>> +++ b/include/linux/act_ioprio.h
>>> @@ -0,0 +1,60 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +#ifndef _ACT_IOPRIO_H
>>> +#define _ACT_IOPRIO_H
>>> +
>>> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
>>> +#include <linux/bio.h>
>>> +
>>> +static __maybe_unused
>>> +bool act_bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
>>> +             size_t off)
>>> +{
>>> +     int class, level, hint, activity;
>>> +     bool ret;
>>> +
>>> +     ret = bio_add_folio(bio, folio, len, off);
>>> +     if (bio_op(bio) == REQ_OP_READ && ret) {
>>> +             class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
>>> +             level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
>>> +             hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
>>> +             activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
>>> +             activity += (activity < IOPRIO_NR_ACTIVITY &&
>>> +                             folio_test_workingset(folio)) ? 1 : 0;
>>> +             if (activity >= bio->bi_vcnt / 2)
>>> +                     class = IOPRIO_CLASS_RT;
>>> +             else if (activity >= bio->bi_vcnt / 4)
>>> +                     class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
>>> +             activity = min(IOPRIO_NR_ACTIVITY - 1, activity);
>>> +             bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
>>> +     }
>>> +     return ret;
>>> +}
>>
>> Big non-inline functions in a header file... That is unusual, to say the least.
>> So every FS that includes this will get its own copy of the binary for these
>> functions. That is not exactly optimal.
> Thanks for quick reply:D
> This is a trade-off method for having both the block layer and fs be
> clean and do no modification. There is less calling bio_add_xxx within
> fs actually.
>>
>>> +
>>> +static __maybe_unused
>>> +int act_bio_add_page(struct bio *bio, struct page *page,
>>> +             unsigned int len, unsigned int offset)
>>> +{
>>> +     int class, level, hint, activity;
>>> +     int ret = 0;
>>> +
>>> +     ret = bio_add_page(bio, page, len, offset);
>>> +     if (bio_op(bio) == REQ_OP_READ && ret > 0) {
>>> +             class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
>>> +             level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
>>> +             hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
>>> +             activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
>>> +             activity += (activity < IOPRIO_NR_ACTIVITY &&
>>> +                             PageWorkingset(page)) ? 1 : 0;
>>> +             if (activity >= bio->bi_vcnt / 2)
>>> +                     class = IOPRIO_CLASS_RT;
>>> +             else if (activity >= bio->bi_vcnt / 4)
>>> +                     class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
>>> +             activity = min(IOPRIO_NR_ACTIVITY - 1, activity);
>>> +             bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
>>> +     }
>>> +     return ret;
>>> +}
>>> +#define bio_add_folio(bio, folio, len, off)     act_bio_add_folio(bio, folio, len, off)
>>> +#define bio_add_page(bio, page, len, offset)    act_bio_add_page(bio, page, len, offset)
>>
>> These functions are *NOT* part of the block layer. So please do not pretend they
>> are. Why don't you simply write a function equivalent to what you have inside
>> the "if" above and have the FS call that after bio_add_Page() ?
> The iteration of bio is costly(could be maximum to 256 pages) and
> needs fs's code modification. I will implement a version as you
> suggested.
>>
>> And I seriously doubt that all compilers will be happy with these macro names
>> clashing with real function names...
>>
>>> +#endif
>>> +#endif
>>> diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
>>> index bee2bdb0eedb..64cf5ff0ac5f 100644
>>> --- a/include/uapi/linux/ioprio.h
>>> +++ b/include/uapi/linux/ioprio.h
>>> @@ -71,12 +71,24 @@ enum {
>>>   * class and level.
>>>   */
>>>  #define IOPRIO_HINT_SHIFT            IOPRIO_LEVEL_NR_BITS
>>> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
>>> +#define IOPRIO_HINT_NR_BITS          3
>>> +#else
>>>  #define IOPRIO_HINT_NR_BITS          10
>>> +#endif
>>>  #define IOPRIO_NR_HINTS                      (1 << IOPRIO_HINT_NR_BITS)
>>>  #define IOPRIO_HINT_MASK             (IOPRIO_NR_HINTS - 1)
>>>  #define IOPRIO_PRIO_HINT(ioprio)     \
>>>       (((ioprio) >> IOPRIO_HINT_SHIFT) & IOPRIO_HINT_MASK)
>>>
>>> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
>>> +#define IOPRIO_ACTIVITY_SHIFT                (IOPRIO_HINT_NR_BITS + IOPRIO_LEVEL_NR_BITS)
>>> +#define IOPRIO_ACTIVITY_NR_BITS              7
>>
>> I already told you that taking all the free hint bits for yourself, leaving no
>> room fo future IO hints, is not nice. Do you really need 7 bits for your thing ?
>> Why does the activity even need to be part of the IO priority ? From the rather
>> short explanation in the commit message, it seems that activity should simply
>> raise the priority (either class or level or both). I do not see why that
>> activity number needs to be in the ioprio. Who in the kernel will look at it ?
>> IO scheduler ? the storage device ?
> As I explained above, 7 bits(128 of 256) within ioprio is the minimum
> number for counting active pages carried by this bio and will end at
> the IO scheduler. bio has to be enlarged a new member to log these if
> we don't use ioprio.

That information does not belong to the ioprio. And which scheduler acts on a
number of pages anyway ? The scheduler sees requests and BIOs. It can determine
the number of pages they have if that is an information it needs to make
scheduling decisison. Using ioprio to pass that information down is a dirty hack.

>>
>>> +#define IOPRIO_NR_ACTIVITY           (1 << IOPRIO_ACTIVITY_NR_BITS)
>>> +#define IOPRIO_ACTIVITY_MASK         (IOPRIO_NR_ACTIVITY - 1)
>>> +#define IOPRIO_PRIO_ACTIVITY(ioprio) \
>>> +     (((ioprio) >> IOPRIO_ACTIVITY_SHIFT) & IOPRIO_ACTIVITY_MASK)
>>> +#endif
>>>  /*
>>>   * I/O hints.
>>>   */
>>> @@ -104,6 +116,7 @@ enum {
>>>
>>>  #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))
>>>
>>> +#ifndef CONFIG_CONTENT_ACT_BASED_IOPRIO
>>>  /*
>>>   * Return an I/O priority value based on a class, a level and a hint.
>>>   */
>>> @@ -123,5 +136,30 @@ static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
>>>       ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE)
>>>  #define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint)       \
>>>       ioprio_value(prioclass, priolevel, priohint)
>>> +#else
>>> +/*
>>> + * Return an I/O priority value based on a class, a level, a hint and
>>> + * content's activities
>>> + */
>>> +static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
>>> +             int priohint, int activity)
>>> +{
>>> +     if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
>>> +                     IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
>>> +                     IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS) ||
>>> +                     IOPRIO_BAD_VALUE(activity, IOPRIO_NR_ACTIVITY))
>>> +             return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
>>>
>>> +     return (prioclass << IOPRIO_CLASS_SHIFT) |
>>> +             (activity << IOPRIO_ACTIVITY_SHIFT) |
>>> +             (priohint << IOPRIO_HINT_SHIFT) | priolevel;
>>> +}
>>> +
>>> +#define IOPRIO_PRIO_VALUE(prioclass, priolevel)                      \
>>> +     ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE, 0)
>>> +#define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint)       \
>>> +     ioprio_value(prioclass, priolevel, priohint, 0)
>>> +#define IOPRIO_PRIO_VALUE_ACTIVITY(prioclass, priolevel, priohint, activity) \
>>> +     ioprio_value(prioclass, priolevel, priohint, activity)
>>> +#endif
>>>  #endif /* _UAPI_LINUX_IOPRIO_H */
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 264a2df5ecf5..e0e5a5a44ded 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -1240,6 +1240,14 @@ config LRU_GEN_STATS
>>>         from evicted generations for debugging purpose.
>>>
>>>         This option has a per-memcg and per-node memory overhead.
>>> +
>>> +config CONTENT_ACT_BASED_IOPRIO
>>> +     bool "Enable content activity based ioprio"
>>> +     depends on LRU_GEN
>>> +     default n
>>> +     help
>>> +       This item enable the feature of adjust bio's priority by
>>> +       calculating its content's activity.
>>>  # }
>>>
>>>  config ARCH_SUPPORTS_PER_VMA_LOCK
>>
>> --
>> Damien Le Moal
>> Western Digital Research
>>
Zhaoyang Huang Jan. 30, 2024, 1:28 p.m. UTC | #4
On Tue, Jan 30, 2024 at 9:17 PM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 1/30/24 21:43, Zhaoyang Huang wrote:
> > On Tue, Jan 30, 2024 at 5:17 PM Damien Le Moal <dlemoal@kernel.org> wrote:
> >>
> >> On 1/30/24 17:42, zhaoyang.huang wrote:
> >>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >>>
> >>> Currently, request's ioprio are set via task's schedule priority(when no
> >>> blkcg configured), which has high priority tasks possess the privilege on
> >>> both of CPU and IO scheduling.
> >>> This commit works as a hint of original policy by promoting the request ioprio
> >>> based on the page/folio's activity. The original idea comes from LRU_GEN
> >>> which provides more precised folio activity than before. This commit try
> >>> to adjust the request's ioprio when certain part of its folios are hot,
> >>> which indicate that this request carry important contents and need be
> >>> scheduled ealier.
> >>>
> >>> This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
> >>> by changing the bio_add_page/folio API in erofs, ext4 and f2fs in
> >>> another commit.
> >>>
> >>> Case 1:
> >>> script[a] which get significant improved fault time as expected[b]
> >>> where dd's cost also shrink from 55s to 40s.
> >>> (1). fault_latency.bin is an ebpf based test tool which measure all task's
> >>>    iowait latency during page fault when scheduled out/in.
> >>> (2). costmem generate page fault by mmaping a file and access the VA.
> >>> (3). dd generate concurrent vfs io.
> >>>
> >>> [a]
> >>> ./fault_latency.bin 1 5 > /data/dd_costmem &
> >>> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> >>> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> >>> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> >>> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> >>> dd if=/dev/block/sda of=/data/ddtest bs=1024 count=2048000 &
> >>> dd if=/dev/block/sda of=/data/ddtest1 bs=1024 count=2048000 &
> >>> dd if=/dev/block/sda of=/data/ddtest2 bs=1024 count=2048000 &
> >>> dd if=/dev/block/sda of=/data/ddtest3 bs=1024 count=2048000
> >>> [b]
> >>>                        mainline               commit
> >>> io wait                836us            156us
> >>>
> >>> Case 2:
> >>> fio -filename=/dev/block/by-name/userdata -rw=randread -direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest
> >>> mainline: 513MiB/s
> >>> READ: bw=531MiB/s (557MB/s), 531MiB/s-531MiB/s (557MB/s-557MB/s), io=15.6GiB (16.8GB), run=30137-30137msec
> >>> READ: bw=543MiB/s (569MB/s), 543MiB/s-543MiB/s (569MB/s-569MB/s), io=15.6GiB (16.8GB), run=29469-29469msec
> >>> READ: bw=474MiB/s (497MB/s), 474MiB/s-474MiB/s (497MB/s-497MB/s), io=15.6GiB (16.8GB), run=33724-33724msec
> >>> READ: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=15.6GiB (16.8GB), run=29928-29928msec
> >>> READ: bw=523MiB/s (548MB/s), 523MiB/s-523MiB/s (548MB/s-548MB/s), io=15.6GiB (16.8GB), run=30617-30617msec
> >>> READ: bw=492MiB/s (516MB/s), 492MiB/s-492MiB/s (516MB/s-516MB/s), io=15.6GiB (16.8GB), run=32518-32518msec
> >>> READ: bw=533MiB/s (559MB/s), 533MiB/s-533MiB/s (559MB/s-559MB/s), io=15.6GiB (16.8GB), run=29993-29993msec
> >>> READ: bw=524MiB/s (550MB/s), 524MiB/s-524MiB/s (550MB/s-550MB/s), io=15.6GiB (16.8GB), run=30526-30526msec
> >>> READ: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=15.6GiB (16.8GB), run=30269-30269msec
> >>> READ: bw=449MiB/s (471MB/s), 449MiB/s-449MiB/s (471MB/s-471MB/s), io=15.6GiB (16.8GB), run=35629-35629msec
> >>>
> >>> commit: 633MiB/s
> >>> READ: bw=668MiB/s (700MB/s), 668MiB/s-668MiB/s (700MB/s-700MB/s), io=15.6GiB (16.8GB), run=23952-23952msec
> >>> READ: bw=589MiB/s (618MB/s), 589MiB/s-589MiB/s (618MB/s-618MB/s), io=15.6GiB (16.8GB), run=27164-27164msec
> >>> READ: bw=638MiB/s (669MB/s), 638MiB/s-638MiB/s (669MB/s-669MB/s), io=15.6GiB (16.8GB), run=25071-25071msec
> >>> READ: bw=714MiB/s (749MB/s), 714MiB/s-714MiB/s (749MB/s-749MB/s), io=15.6GiB (16.8GB), run=22409-22409msec
> >>> READ: bw=600MiB/s (629MB/s), 600MiB/s-600MiB/s (629MB/s-629MB/s), io=15.6GiB (16.8GB), run=26669-26669msec
> >>> READ: bw=592MiB/s (621MB/s), 592MiB/s-592MiB/s (621MB/s-621MB/s), io=15.6GiB (16.8GB), run=27036-27036msec
> >>> READ: bw=691MiB/s (725MB/s), 691MiB/s-691MiB/s (725MB/s-725MB/s), io=15.6GiB (16.8GB), run=23150-23150msec
> >>> READ: bw=569MiB/s (596MB/s), 569MiB/s-569MiB/s (596MB/s-596MB/s), io=15.6GiB (16.8GB), run=28142-28142msec
> >>> READ: bw=563MiB/s (590MB/s), 563MiB/s-563MiB/s (590MB/s-590MB/s), io=15.6GiB (16.8GB), run=28429-28429msec
> >>> READ: bw=712MiB/s (746MB/s), 712MiB/s-712MiB/s (746MB/s-746MB/s), io=15.6GiB (16.8GB), run=22478-22478msec
> >>>
> >>> Case 3:
> >>> This commit is also verified by the case of launching camera APP which is
> >>> usually considered as heavy working load on both of memory and IO, which
> >>> shows 12%-24% improvement.
> >>>
> >>>               ttl = 0         ttl = 50        ttl = 100
> >>> mainline        2267ms                2420ms          2316ms
> >>> commit          1992ms          1806ms          1998ms
> >>>
> >>> case 4:
> >>> androbench has no improvment as well as regression which supposed to be
> >>> its test time is short which MGLRU hasn't take effect yet.
> >>>
> >>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >>> ---
> >>> change of v2: calculate page's activity via helper function
> >>> change of v3: solve layer violation by move API into mm
> >>> change of v4: keep block clean by removing the page related API
> >>> change of v5: introduce the macros of bio_add_folio/page for read dir.
> >>> ---
> >>> ---
> >>>  include/linux/act_ioprio.h  | 60 +++++++++++++++++++++++++++++++++++++
> >>>  include/uapi/linux/ioprio.h | 38 +++++++++++++++++++++++
> >>>  mm/Kconfig                  |  8 +++++
> >>>  3 files changed, 106 insertions(+)
> >>>  create mode 100644 include/linux/act_ioprio.h
> >>>
> >>> diff --git a/include/linux/act_ioprio.h b/include/linux/act_ioprio.h
> >>> new file mode 100644
> >>> index 000000000000..ca7309b85758
> >>> --- /dev/null
> >>> +++ b/include/linux/act_ioprio.h
> >>> @@ -0,0 +1,60 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> >>> +#ifndef _ACT_IOPRIO_H
> >>> +#define _ACT_IOPRIO_H
> >>> +
> >>> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> >>> +#include <linux/bio.h>
> >>> +
> >>> +static __maybe_unused
> >>> +bool act_bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
> >>> +             size_t off)
> >>> +{
> >>> +     int class, level, hint, activity;
> >>> +     bool ret;
> >>> +
> >>> +     ret = bio_add_folio(bio, folio, len, off);
> >>> +     if (bio_op(bio) == REQ_OP_READ && ret) {
> >>> +             class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> >>> +             level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> >>> +             hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> >>> +             activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> >>> +             activity += (activity < IOPRIO_NR_ACTIVITY &&
> >>> +                             folio_test_workingset(folio)) ? 1 : 0;
> >>> +             if (activity >= bio->bi_vcnt / 2)
> >>> +                     class = IOPRIO_CLASS_RT;
> >>> +             else if (activity >= bio->bi_vcnt / 4)
> >>> +                     class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
> >>> +             activity = min(IOPRIO_NR_ACTIVITY - 1, activity);
> >>> +             bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> >>> +     }
> >>> +     return ret;
> >>> +}
> >>
> >> Big non-inline functions in a header file... That is unusual, to say the least.
> >> So every FS that includes this will get its own copy of the binary for these
> >> functions. That is not exactly optimal.
> > Thanks for quick reply:D
> > This is a trade-off method for having both the block layer and fs be
> > clean and do no modification. There is less calling bio_add_xxx within
> > fs actually.
> >>
> >>> +
> >>> +static __maybe_unused
> >>> +int act_bio_add_page(struct bio *bio, struct page *page,
> >>> +             unsigned int len, unsigned int offset)
> >>> +{
> >>> +     int class, level, hint, activity;
> >>> +     int ret = 0;
> >>> +
> >>> +     ret = bio_add_page(bio, page, len, offset);
> >>> +     if (bio_op(bio) == REQ_OP_READ && ret > 0) {
> >>> +             class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> >>> +             level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> >>> +             hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> >>> +             activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> >>> +             activity += (activity < IOPRIO_NR_ACTIVITY &&
> >>> +                             PageWorkingset(page)) ? 1 : 0;
> >>> +             if (activity >= bio->bi_vcnt / 2)
> >>> +                     class = IOPRIO_CLASS_RT;
> >>> +             else if (activity >= bio->bi_vcnt / 4)
> >>> +                     class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
> >>> +             activity = min(IOPRIO_NR_ACTIVITY - 1, activity);
> >>> +             bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> >>> +     }
> >>> +     return ret;
> >>> +}
> >>> +#define bio_add_folio(bio, folio, len, off)     act_bio_add_folio(bio, folio, len, off)
> >>> +#define bio_add_page(bio, page, len, offset)    act_bio_add_page(bio, page, len, offset)
> >>
> >> These functions are *NOT* part of the block layer. So please do not pretend they
> >> are. Why don't you simply write a function equivalent to what you have inside
> >> the "if" above and have the FS call that after bio_add_Page() ?
> > The iteration of bio is costly(could be maximum to 256 pages) and
> > needs fs's code modification. I will implement a version as you
> > suggested.
> >>
> >> And I seriously doubt that all compilers will be happy with these macro names
> >> clashing with real function names...
> >>
> >>> +#endif
> >>> +#endif
> >>> diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
> >>> index bee2bdb0eedb..64cf5ff0ac5f 100644
> >>> --- a/include/uapi/linux/ioprio.h
> >>> +++ b/include/uapi/linux/ioprio.h
> >>> @@ -71,12 +71,24 @@ enum {
> >>>   * class and level.
> >>>   */
> >>>  #define IOPRIO_HINT_SHIFT            IOPRIO_LEVEL_NR_BITS
> >>> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> >>> +#define IOPRIO_HINT_NR_BITS          3
> >>> +#else
> >>>  #define IOPRIO_HINT_NR_BITS          10
> >>> +#endif
> >>>  #define IOPRIO_NR_HINTS                      (1 << IOPRIO_HINT_NR_BITS)
> >>>  #define IOPRIO_HINT_MASK             (IOPRIO_NR_HINTS - 1)
> >>>  #define IOPRIO_PRIO_HINT(ioprio)     \
> >>>       (((ioprio) >> IOPRIO_HINT_SHIFT) & IOPRIO_HINT_MASK)
> >>>
> >>> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> >>> +#define IOPRIO_ACTIVITY_SHIFT                (IOPRIO_HINT_NR_BITS + IOPRIO_LEVEL_NR_BITS)
> >>> +#define IOPRIO_ACTIVITY_NR_BITS              7
> >>
> >> I already told you that taking all the free hint bits for yourself, leaving no
> >> room fo future IO hints, is not nice. Do you really need 7 bits for your thing ?
> >> Why does the activity even need to be part of the IO priority ? From the rather
> >> short explanation in the commit message, it seems that activity should simply
> >> raise the priority (either class or level or both). I do not see why that
> >> activity number needs to be in the ioprio. Who in the kernel will look at it ?
> >> IO scheduler ? the storage device ?
> > As I explained above, 7 bits(128 of 256) within ioprio is the minimum
> > number for counting active pages carried by this bio and will end at
> > the IO scheduler. bio has to be enlarged a new member to log these if
> > we don't use ioprio.
>
> That information does not belong to the ioprio. And which scheduler acts on a
> number of pages anyway ? The scheduler sees requests and BIOs. It can determine
> the number of pages they have if that is an information it needs to make
> scheduling decisison. Using ioprio to pass that information down is a dirty hack.
No. IO scheduler acts on IOPRIO_CLASS which is transferred from the
page's activity by the current method. I will implement another
version of iterating pages before submit_bio and feed back to the list
>
> >>
> >>> +#define IOPRIO_NR_ACTIVITY           (1 << IOPRIO_ACTIVITY_NR_BITS)
> >>> +#define IOPRIO_ACTIVITY_MASK         (IOPRIO_NR_ACTIVITY - 1)
> >>> +#define IOPRIO_PRIO_ACTIVITY(ioprio) \
> >>> +     (((ioprio) >> IOPRIO_ACTIVITY_SHIFT) & IOPRIO_ACTIVITY_MASK)
> >>> +#endif
> >>>  /*
> >>>   * I/O hints.
> >>>   */
> >>> @@ -104,6 +116,7 @@ enum {
> >>>
> >>>  #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))
> >>>
> >>> +#ifndef CONFIG_CONTENT_ACT_BASED_IOPRIO
> >>>  /*
> >>>   * Return an I/O priority value based on a class, a level and a hint.
> >>>   */
> >>> @@ -123,5 +136,30 @@ static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
> >>>       ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE)
> >>>  #define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint)       \
> >>>       ioprio_value(prioclass, priolevel, priohint)
> >>> +#else
> >>> +/*
> >>> + * Return an I/O priority value based on a class, a level, a hint and
> >>> + * content's activities
> >>> + */
> >>> +static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
> >>> +             int priohint, int activity)
> >>> +{
> >>> +     if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
> >>> +                     IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
> >>> +                     IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS) ||
> >>> +                     IOPRIO_BAD_VALUE(activity, IOPRIO_NR_ACTIVITY))
> >>> +             return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
> >>>
> >>> +     return (prioclass << IOPRIO_CLASS_SHIFT) |
> >>> +             (activity << IOPRIO_ACTIVITY_SHIFT) |
> >>> +             (priohint << IOPRIO_HINT_SHIFT) | priolevel;
> >>> +}
> >>> +
> >>> +#define IOPRIO_PRIO_VALUE(prioclass, priolevel)                      \
> >>> +     ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE, 0)
> >>> +#define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint)       \
> >>> +     ioprio_value(prioclass, priolevel, priohint, 0)
> >>> +#define IOPRIO_PRIO_VALUE_ACTIVITY(prioclass, priolevel, priohint, activity) \
> >>> +     ioprio_value(prioclass, priolevel, priohint, activity)
> >>> +#endif
> >>>  #endif /* _UAPI_LINUX_IOPRIO_H */
> >>> diff --git a/mm/Kconfig b/mm/Kconfig
> >>> index 264a2df5ecf5..e0e5a5a44ded 100644
> >>> --- a/mm/Kconfig
> >>> +++ b/mm/Kconfig
> >>> @@ -1240,6 +1240,14 @@ config LRU_GEN_STATS
> >>>         from evicted generations for debugging purpose.
> >>>
> >>>         This option has a per-memcg and per-node memory overhead.
> >>> +
> >>> +config CONTENT_ACT_BASED_IOPRIO
> >>> +     bool "Enable content activity based ioprio"
> >>> +     depends on LRU_GEN
> >>> +     default n
> >>> +     help
> >>> +       This item enable the feature of adjust bio's priority by
> >>> +       calculating its content's activity.
> >>>  # }
> >>>
> >>>  config ARCH_SUPPORTS_PER_VMA_LOCK
> >>
> >> --
> >> Damien Le Moal
> >> Western Digital Research
> >>
>
> --
> Damien Le Moal
> Western Digital Research
>
Matthew Wilcox (Oracle) Jan. 30, 2024, 1:52 p.m. UTC | #5
On Tue, Jan 30, 2024 at 04:42:07PM +0800, zhaoyang.huang wrote:
> +static __maybe_unused
> +bool act_bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
> +		size_t off)

No, stop this.  What the filesystem needs to do is not
s/bio_add_folio/act_bio_add_folio/.  There needs to be an API to set the
bio prio; something like:

	bio_set_active_prio(bio, folio);

and then you can do whatever gunk is needed in bio_set_active_prio()
to end up calling bio_set_prio().
Damien Le Moal Jan. 30, 2024, 1:53 p.m. UTC | #6
On 1/30/24 22:28, Zhaoyang Huang wrote:
>> That information does not belong to the ioprio. And which scheduler acts on a
>> number of pages anyway ? The scheduler sees requests and BIOs. It can determine
>> the number of pages they have if that is an information it needs to make
>> scheduling decisison. Using ioprio to pass that information down is a dirty hack.
> No. IO scheduler acts on IOPRIO_CLASS which is transferred from the
> page's activity by the current method. I will implement another
> version of iterating pages before submit_bio and feed back to the list

Then why are you modifying the ioprio user API to add the 7 bits of activity ?
If the scheduler only needs the priority class, then only set that and do not
touch the user facing API.
diff mbox series

Patch

diff --git a/include/linux/act_ioprio.h b/include/linux/act_ioprio.h
new file mode 100644
index 000000000000..ca7309b85758
--- /dev/null
+++ b/include/linux/act_ioprio.h
@@ -0,0 +1,60 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _ACT_IOPRIO_H
+#define _ACT_IOPRIO_H
+
+#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
+#include <linux/bio.h>
+
+static __maybe_unused
+bool act_bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
+		size_t off)
+{
+	int class, level, hint, activity;
+	bool ret;
+
+	ret = bio_add_folio(bio, folio, len, off);
+	if (bio_op(bio) == REQ_OP_READ && ret) {
+		class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
+		level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
+		hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
+		activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
+		activity += (activity < IOPRIO_NR_ACTIVITY &&
+				folio_test_workingset(folio)) ? 1 : 0;
+		if (activity >= bio->bi_vcnt / 2)
+			class = IOPRIO_CLASS_RT;
+		else if (activity >= bio->bi_vcnt / 4)
+			class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
+		activity = min(IOPRIO_NR_ACTIVITY - 1, activity);
+		bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
+	}
+	return ret;
+}
+
+static __maybe_unused
+int act_bio_add_page(struct bio *bio, struct page *page,
+		unsigned int len, unsigned int offset)
+{
+	int class, level, hint, activity;
+	int ret = 0;
+
+	ret = bio_add_page(bio, page, len, offset);
+	if (bio_op(bio) == REQ_OP_READ && ret > 0) {
+		class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
+		level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
+		hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
+		activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
+		activity += (activity < IOPRIO_NR_ACTIVITY &&
+				PageWorkingset(page)) ? 1 : 0;
+		if (activity >= bio->bi_vcnt / 2)
+			class = IOPRIO_CLASS_RT;
+		else if (activity >= bio->bi_vcnt / 4)
+			class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
+		activity = min(IOPRIO_NR_ACTIVITY - 1, activity);
+		bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
+	}
+	return ret;
+}
+#define bio_add_folio(bio, folio, len, off)     act_bio_add_folio(bio, folio, len, off)
+#define bio_add_page(bio, page, len, offset)    act_bio_add_page(bio, page, len, offset)
+#endif
+#endif
diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
index bee2bdb0eedb..64cf5ff0ac5f 100644
--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -71,12 +71,24 @@  enum {
  * class and level.
  */
 #define IOPRIO_HINT_SHIFT		IOPRIO_LEVEL_NR_BITS
+#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
+#define IOPRIO_HINT_NR_BITS		3
+#else
 #define IOPRIO_HINT_NR_BITS		10
+#endif
 #define IOPRIO_NR_HINTS			(1 << IOPRIO_HINT_NR_BITS)
 #define IOPRIO_HINT_MASK		(IOPRIO_NR_HINTS - 1)
 #define IOPRIO_PRIO_HINT(ioprio)	\
 	(((ioprio) >> IOPRIO_HINT_SHIFT) & IOPRIO_HINT_MASK)
 
+#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
+#define IOPRIO_ACTIVITY_SHIFT		(IOPRIO_HINT_NR_BITS + IOPRIO_LEVEL_NR_BITS)
+#define IOPRIO_ACTIVITY_NR_BITS		7
+#define IOPRIO_NR_ACTIVITY		(1 << IOPRIO_ACTIVITY_NR_BITS)
+#define IOPRIO_ACTIVITY_MASK		(IOPRIO_NR_ACTIVITY - 1)
+#define IOPRIO_PRIO_ACTIVITY(ioprio)	\
+	(((ioprio) >> IOPRIO_ACTIVITY_SHIFT) & IOPRIO_ACTIVITY_MASK)
+#endif
 /*
  * I/O hints.
  */
@@ -104,6 +116,7 @@  enum {
 
 #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))
 
+#ifndef CONFIG_CONTENT_ACT_BASED_IOPRIO
 /*
  * Return an I/O priority value based on a class, a level and a hint.
  */
@@ -123,5 +136,30 @@  static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
 	ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE)
 #define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint)	\
 	ioprio_value(prioclass, priolevel, priohint)
+#else
+/*
+ * Return an I/O priority value based on a class, a level, a hint and
+ * content's activities
+ */
+static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
+		int priohint, int activity)
+{
+	if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
+			IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
+			IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS) ||
+			IOPRIO_BAD_VALUE(activity, IOPRIO_NR_ACTIVITY))
+		return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
 
+	return (prioclass << IOPRIO_CLASS_SHIFT) |
+		(activity << IOPRIO_ACTIVITY_SHIFT) |
+		(priohint << IOPRIO_HINT_SHIFT) | priolevel;
+}
+
+#define IOPRIO_PRIO_VALUE(prioclass, priolevel)			\
+	ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE, 0)
+#define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint)	\
+	ioprio_value(prioclass, priolevel, priohint, 0)
+#define IOPRIO_PRIO_VALUE_ACTIVITY(prioclass, priolevel, priohint, activity)	\
+	ioprio_value(prioclass, priolevel, priohint, activity)
+#endif
 #endif /* _UAPI_LINUX_IOPRIO_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 264a2df5ecf5..e0e5a5a44ded 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1240,6 +1240,14 @@  config LRU_GEN_STATS
 	  from evicted generations for debugging purpose.
 
 	  This option has a per-memcg and per-node memory overhead.
+
+config CONTENT_ACT_BASED_IOPRIO
+	bool "Enable content activity based ioprio"
+	depends on LRU_GEN
+	default n
+	help
+	  This item enable the feature of adjust bio's priority by
+	  calculating its content's activity.
 # }
 
 config ARCH_SUPPORTS_PER_VMA_LOCK