diff mbox series

[RFC] Introduce generalized data temperature estimation framework

Message ID 20250123202455.11338-1-slava@dubeyko.com (mailing list archive)
State New
Headers show
Series [RFC] Introduce generalized data temperature estimation framework | expand

Commit Message

Viacheslav Dubeyko Jan. 23, 2025, 8:24 p.m. UTC
[PROBLEM DECLARATION]
Efficient data placement policy is a Holy Grail for data
storage and file system engineers. Achieving this goal is
equally important and really hard. Multiple data storage
and file system technologies have been invented to manage
the data placement policy (for example, COW, ZNS, FDP, etc).
But these technologies still require the hints related to
nature of data from application side.

[DATA "TEMPERATURE" CONCEPT]
One of the widely used and intuitively clear idea of data
nature definition is data "temperature" (cold, warm,
hot data). However, data "temperature" is as intuitively
sound as illusive definition of data nature. Generally
speaking, thermodynamics defines temperature as a way
to estimate the average kinetic energy of vibrating
atoms in a substance. But we cannot see a direct analogy
between data "temperature" and temperature in physics
because data is not something that has kinetic energy.

[WHAT IS GENERALIZED DATA "TEMPERATURE" ESTIMATION]
We usually imply that if some data is updated more
frequently, then such data is more hot than other one.
But, it is possible to see several problems here:
(1) How can we estimate the data "hotness" in
quantitative way? (2) We can state that data is "hot"
after some number of updates. It means that this
definition implies state of the data in the past.
Will this data continue to be "hot" in the future?
Generally speaking, the crucial problem is how to define
the data nature or data "temperature" in the future.
Because, this knowledge is the fundamental basis for
elaboration an efficient data placement policy.
Generalized data "temperature" estimation framework
suggests the way to define a future state of the data
and the basis for quantitative measurement of data
"temperature".

[ARCHITECTURE OF FRAMEWORK]
Usually, file system has a page cache for every inode. And
initially memory pages become dirty in page cache. Finally,
dirty pages will be sent to storage device. Technically
speaking, the number of dirty pages in a particular page
cache is the quantitative measurement of current "hotness"
of a file. But number of dirty pages is still not stable
basis for quantitative measurement of data "temperature".
It is possible to suggest of using the total number of
logical blocks in a file as a unit of one degree of data
"temperature". As a result, if the whole file was updated
several times, then "temperature" of the file has been
increased for several degrees. And if the file is under
continous updates, then the file "temperature" is growing.

We need to keep not only current number of dirty pages,
but also the number of updated pages in the near past
for accumulating the total "temperature" of a file.
Generally speaking, total number of updated pages in the
nearest past defines the aggregated "temperature" of file.
And number of dirty pages defines the delta of
"temperature" growth for current update operation.
This approach defines the mechanism of "temperature" growth.

But if we have no more updates for the file, then
"temperature" needs to decrease. Starting and ending
timestamps of update operation can work as a basis for
decreasing "temperature" of a file. If we know the number
of updated logical blocks of the file, then we can divide
the duration of update operation on number of updated
logical blocks. As a result, this is the way to define
a time duration per one logical block. By means of
multiplying this value (time duration per one logical
block) on total number of logical blocks in file, we
can calculate the time duration of "temperature"
decreasing for one degree. Finally, the operation of
division the time range (between end of last update
operation and begin of new update operation) on
the time duration of "temperature" decreasing for
one degree provides the way to define how many
degrees should be subtracted from current "temperature"
of the file.

[HOW TO USE THE APPROACH]
The lifetime of data "temperature" value for a file
can be explained by steps: (1) iget() method sets
the data "temperature" object; (2) folio_account_dirtied()
method accounts the number of dirty memory pages and
tries to estimate the current temperature of the file;
(3) folio_clear_dirty_for_io() decrease number of dirty
memory pages and increases number of updated pages;
(4) folio_account_dirtied() also decreases file's
"temperature" if updates hasn't happened some time;
(5) file system can get file's temperature and
to share the hint with block layer; (6) inode
eviction method removes and free the data "temperature"
object.

Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
---
 fs/Kconfig                             |   2 +
 fs/Makefile                            |   1 +
 fs/data-temperature/Kconfig            |  11 +
 fs/data-temperature/Makefile           |   3 +
 fs/data-temperature/data_temperature.c | 347 +++++++++++++++++++++++++
 include/linux/data_temperature.h       | 124 +++++++++
 include/linux/fs.h                     |   4 +
 mm/page-writeback.c                    |   9 +
 8 files changed, 501 insertions(+)
 create mode 100644 fs/data-temperature/Kconfig
 create mode 100644 fs/data-temperature/Makefile
 create mode 100644 fs/data-temperature/data_temperature.c
 create mode 100644 include/linux/data_temperature.h

Comments

Johannes Thumshirn Jan. 24, 2025, 8:19 a.m. UTC | #1
On 23.01.25 21:30, Viacheslav Dubeyko wrote:
> [PROBLEM DECLARATION]
> Efficient data placement policy is a Holy Grail for data
> storage and file system engineers. Achieving this goal is
> equally important and really hard. Multiple data storage
> and file system technologies have been invented to manage
> the data placement policy (for example, COW, ZNS, FDP, etc).
> But these technologies still require the hints related to
> nature of data from application side.
> 
> [DATA "TEMPERATURE" CONCEPT]
> One of the widely used and intuitively clear idea of data
> nature definition is data "temperature" (cold, warm,
> hot data). However, data "temperature" is as intuitively
> sound as illusive definition of data nature. Generally
> speaking, thermodynamics defines temperature as a way
> to estimate the average kinetic energy of vibrating
> atoms in a substance. But we cannot see a direct analogy
> between data "temperature" and temperature in physics
> because data is not something that has kinetic energy.
> 
> [WHAT IS GENERALIZED DATA "TEMPERATURE" ESTIMATION]
> We usually imply that if some data is updated more
> frequently, then such data is more hot than other one.
> But, it is possible to see several problems here:
> (1) How can we estimate the data "hotness" in
> quantitative way? (2) We can state that data is "hot"
> after some number of updates. It means that this
> definition implies state of the data in the past.
> Will this data continue to be "hot" in the future?
> Generally speaking, the crucial problem is how to define
> the data nature or data "temperature" in the future.
> Because, this knowledge is the fundamental basis for
> elaboration an efficient data placement policy.
> Generalized data "temperature" estimation framework
> suggests the way to define a future state of the data
> and the basis for quantitative measurement of data
> "temperature".
> 
> [ARCHITECTURE OF FRAMEWORK]
> Usually, file system has a page cache for every inode. And
> initially memory pages become dirty in page cache. Finally,
> dirty pages will be sent to storage device. Technically
> speaking, the number of dirty pages in a particular page
> cache is the quantitative measurement of current "hotness"
> of a file. But number of dirty pages is still not stable
> basis for quantitative measurement of data "temperature".
> It is possible to suggest of using the total number of
> logical blocks in a file as a unit of one degree of data
> "temperature". As a result, if the whole file was updated
> several times, then "temperature" of the file has been
> increased for several degrees. And if the file is under
> continous updates, then the file "temperature" is growing.
> 
> We need to keep not only current number of dirty pages,
> but also the number of updated pages in the near past
> for accumulating the total "temperature" of a file.
> Generally speaking, total number of updated pages in the
> nearest past defines the aggregated "temperature" of file.
> And number of dirty pages defines the delta of
> "temperature" growth for current update operation.
> This approach defines the mechanism of "temperature" growth.
> 
> But if we have no more updates for the file, then
> "temperature" needs to decrease. Starting and ending
> timestamps of update operation can work as a basis for
> decreasing "temperature" of a file. If we know the number
> of updated logical blocks of the file, then we can divide
> the duration of update operation on number of updated
> logical blocks. As a result, this is the way to define
> a time duration per one logical block. By means of
> multiplying this value (time duration per one logical
> block) on total number of logical blocks in file, we
> can calculate the time duration of "temperature"
> decreasing for one degree. Finally, the operation of
> division the time range (between end of last update
> operation and begin of new update operation) on
> the time duration of "temperature" decreasing for
> one degree provides the way to define how many
> degrees should be subtracted from current "temperature"
> of the file.
> 
> [HOW TO USE THE APPROACH]
> The lifetime of data "temperature" value for a file
> can be explained by steps: (1) iget() method sets
> the data "temperature" object; (2) folio_account_dirtied()
> method accounts the number of dirty memory pages and
> tries to estimate the current temperature of the file;
> (3) folio_clear_dirty_for_io() decrease number of dirty
> memory pages and increases number of updated pages;
> (4) folio_account_dirtied() also decreases file's
> "temperature" if updates hasn't happened some time;
> (5) file system can get file's temperature and
> to share the hint with block layer; (6) inode
> eviction method removes and free the data "temperature"
> object.

I don't want to pour gasoline on old flame wars, but what is the 
advantage of this auto-magic data temperature framework vs the existing 
framework? 'enum rw_hint' has temperature in the range of none, short, 
medium, long and extreme (what ever that means), can be set by an 
application via an fcntl() and is plumbed down all the way to the bio 
level by most FSes that care.
Viacheslav Dubeyko Jan. 24, 2025, 9:03 p.m. UTC | #2
On Fri, 2025-01-24 at 08:19 +0000, Johannes Thumshirn wrote:
> On 23.01.25 21:30, Viacheslav Dubeyko wrote:
> > [PROBLEM DECLARATION]
> > Efficient data placement policy is a Holy Grail for data
> > storage and file system engineers. Achieving this goal is
> > equally important and really hard. Multiple data storage
> > and file system technologies have been invented to manage
> > the data placement policy (for example, COW, ZNS, FDP, etc).
> > But these technologies still require the hints related to
> > nature of data from application side.
> > 
> > [DATA "TEMPERATURE" CONCEPT]
> > One of the widely used and intuitively clear idea of data
> > nature definition is data "temperature" (cold, warm,
> > hot data). However, data "temperature" is as intuitively
> > sound as illusive definition of data nature. Generally
> > speaking, thermodynamics defines temperature as a way
> > to estimate the average kinetic energy of vibrating
> > atoms in a substance. But we cannot see a direct analogy
> > between data "temperature" and temperature in physics
> > because data is not something that has kinetic energy.
> > 
> > [WHAT IS GENERALIZED DATA "TEMPERATURE" ESTIMATION]
> > We usually imply that if some data is updated more
> > frequently, then such data is more hot than other one.
> > But, it is possible to see several problems here:
> > (1) How can we estimate the data "hotness" in
> > quantitative way? (2) We can state that data is "hot"
> > after some number of updates. It means that this
> > definition implies state of the data in the past.
> > Will this data continue to be "hot" in the future?
> > Generally speaking, the crucial problem is how to define
> > the data nature or data "temperature" in the future.
> > Because, this knowledge is the fundamental basis for
> > elaboration an efficient data placement policy.
> > Generalized data "temperature" estimation framework
> > suggests the way to define a future state of the data
> > and the basis for quantitative measurement of data
> > "temperature".
> > 
> > [ARCHITECTURE OF FRAMEWORK]
> > Usually, file system has a page cache for every inode. And
> > initially memory pages become dirty in page cache. Finally,
> > dirty pages will be sent to storage device. Technically
> > speaking, the number of dirty pages in a particular page
> > cache is the quantitative measurement of current "hotness"
> > of a file. But number of dirty pages is still not stable
> > basis for quantitative measurement of data "temperature".
> > It is possible to suggest of using the total number of
> > logical blocks in a file as a unit of one degree of data
> > "temperature". As a result, if the whole file was updated
> > several times, then "temperature" of the file has been
> > increased for several degrees. And if the file is under
> > continous updates, then the file "temperature" is growing.
> > 
> > We need to keep not only current number of dirty pages,
> > but also the number of updated pages in the near past
> > for accumulating the total "temperature" of a file.
> > Generally speaking, total number of updated pages in the
> > nearest past defines the aggregated "temperature" of file.
> > And number of dirty pages defines the delta of
> > "temperature" growth for current update operation.
> > This approach defines the mechanism of "temperature" growth.
> > 
> > But if we have no more updates for the file, then
> > "temperature" needs to decrease. Starting and ending
> > timestamps of update operation can work as a basis for
> > decreasing "temperature" of a file. If we know the number
> > of updated logical blocks of the file, then we can divide
> > the duration of update operation on number of updated
> > logical blocks. As a result, this is the way to define
> > a time duration per one logical block. By means of
> > multiplying this value (time duration per one logical
> > block) on total number of logical blocks in file, we
> > can calculate the time duration of "temperature"
> > decreasing for one degree. Finally, the operation of
> > division the time range (between end of last update
> > operation and begin of new update operation) on
> > the time duration of "temperature" decreasing for
> > one degree provides the way to define how many
> > degrees should be subtracted from current "temperature"
> > of the file.
> > 
> > [HOW TO USE THE APPROACH]
> > The lifetime of data "temperature" value for a file
> > can be explained by steps: (1) iget() method sets
> > the data "temperature" object; (2) folio_account_dirtied()
> > method accounts the number of dirty memory pages and
> > tries to estimate the current temperature of the file;
> > (3) folio_clear_dirty_for_io() decrease number of dirty
> > memory pages and increases number of updated pages;
> > (4) folio_account_dirtied() also decreases file's
> > "temperature" if updates hasn't happened some time;
> > (5) file system can get file's temperature and
> > to share the hint with block layer; (6) inode
> > eviction method removes and free the data "temperature"
> > object.
> 
> I don't want to pour gasoline on old flame wars, but what is the 
> advantage of this auto-magic data temperature framework vs the existing 
> framework?
> 

There is no magic in this framework. :) It's simple and compact framework.

>  'enum rw_hint' has temperature in the range of none, short, 
> medium, long and extreme (what ever that means), can be set by an 
> application via an fcntl() and is plumbed down all the way to the bio 
> level by most FSes that care.

I see your point. But the 'enum rw_hint' defines qualitative grades again:

enum rw_hint {
	WRITE_LIFE_NOT_SET	= RWH_WRITE_LIFE_NOT_SET,
	WRITE_LIFE_NONE		= RWH_WRITE_LIFE_NONE,
	WRITE_LIFE_SHORT	= RWH_WRITE_LIFE_SHORT,  <-- HOT data
	WRITE_LIFE_MEDIUM	= RWH_WRITE_LIFE_MEDIUM, <-- WARM data
	WRITE_LIFE_LONG		= RWH_WRITE_LIFE_LONG,   <-- COLD data
	WRITE_LIFE_EXTREME	= RWH_WRITE_LIFE_EXTREME,
} __packed;

First of all, again, it's hard to compare the hotness of different files
on such qualitative basis. Secondly, who decides what is hotness of a particular
data? People can only guess or assume the nature of data based on
experience in the past. But workloads are changing and evolving
continuously and in real-time manner. Technically speaking, application can
try to estimate the hotness of data, but, again, file system can receive
requests from multiple threads and multiple applications. So, application
can guess about real nature of data too. Especially, nobody would like
to implement dedicated logic in application for data hotness estimation.

This framework is inode based and it tries to estimate file's
"temperature" on quantitative basis. Advantages of this framework:
(1) we don't need to guess about data hotness, temperature will be
calculated quantitatively; (2) quantitative basis gives opportunity
for fair comparison of different files' temperature; (3) file's temperature
will change with workload(s) changing in real-time; (4) file's
temperature will be correctly accounted under the load from multiple
applications. I believe these are advantages of the suggested framework.

Thanks,
Slava.
Jeff Layton Jan. 25, 2025, 12:25 p.m. UTC | #3
On Thu, 2025-01-23 at 12:24 -0800, Viacheslav Dubeyko wrote:
> [PROBLEM DECLARATION]
> Efficient data placement policy is a Holy Grail for data
> storage and file system engineers. Achieving this goal is
> equally important and really hard. Multiple data storage
> and file system technologies have been invented to manage
> the data placement policy (for example, COW, ZNS, FDP, etc).
> But these technologies still require the hints related to
> nature of data from application side.
> 
> [DATA "TEMPERATURE" CONCEPT]
> One of the widely used and intuitively clear idea of data
> nature definition is data "temperature" (cold, warm,
> hot data). However, data "temperature" is as intuitively
> sound as illusive definition of data nature. Generally
> speaking, thermodynamics defines temperature as a way
> to estimate the average kinetic energy of vibrating
> atoms in a substance. But we cannot see a direct analogy
> between data "temperature" and temperature in physics
> because data is not something that has kinetic energy.
> 
> [WHAT IS GENERALIZED DATA "TEMPERATURE" ESTIMATION]
> We usually imply that if some data is updated more
> frequently, then such data is more hot than other one.
> But, it is possible to see several problems here:
> (1) How can we estimate the data "hotness" in
> quantitative way? (2) We can state that data is "hot"
> after some number of updates. It means that this
> definition implies state of the data in the past.
> Will this data continue to be "hot" in the future?
> Generally speaking, the crucial problem is how to define
> the data nature or data "temperature" in the future.
> Because, this knowledge is the fundamental basis for
> elaboration an efficient data placement policy.
> Generalized data "temperature" estimation framework
> suggests the way to define a future state of the data
> and the basis for quantitative measurement of data
> "temperature".
> 
> [ARCHITECTURE OF FRAMEWORK]
> Usually, file system has a page cache for every inode. And
> initially memory pages become dirty in page cache. Finally,
> dirty pages will be sent to storage device. Technically
> speaking, the number of dirty pages in a particular page
> cache is the quantitative measurement of current "hotness"
> of a file. But number of dirty pages is still not stable
> basis for quantitative measurement of data "temperature".
> It is possible to suggest of using the total number of
> logical blocks in a file as a unit of one degree of data
> "temperature". As a result, if the whole file was updated
> several times, then "temperature" of the file has been
> increased for several degrees. And if the file is under
> continous updates, then the file "temperature" is growing.
> 
> We need to keep not only current number of dirty pages,
> but also the number of updated pages in the near past
> for accumulating the total "temperature" of a file.
> Generally speaking, total number of updated pages in the
> nearest past defines the aggregated "temperature" of file.
> And number of dirty pages defines the delta of
> "temperature" growth for current update operation.
> This approach defines the mechanism of "temperature" growth.
> 
> But if we have no more updates for the file, then
> "temperature" needs to decrease. Starting and ending
> timestamps of update operation can work as a basis for
> decreasing "temperature" of a file. If we know the number
> of updated logical blocks of the file, then we can divide
> the duration of update operation on number of updated
> logical blocks. As a result, this is the way to define
> a time duration per one logical block. By means of
> multiplying this value (time duration per one logical
> block) on total number of logical blocks in file, we
> can calculate the time duration of "temperature"
> decreasing for one degree. Finally, the operation of
> division the time range (between end of last update
> operation and begin of new update operation) on
> the time duration of "temperature" decreasing for
> one degree provides the way to define how many
> degrees should be subtracted from current "temperature"
> of the file.
> 
> [HOW TO USE THE APPROACH]
> The lifetime of data "temperature" value for a file
> can be explained by steps: (1) iget() method sets
> the data "temperature" object; (2) folio_account_dirtied()
> method accounts the number of dirty memory pages and
> tries to estimate the current temperature of the file;
> (3) folio_clear_dirty_for_io() decrease number of dirty
> memory pages and increases number of updated pages;
> (4) folio_account_dirtied() also decreases file's
> "temperature" if updates hasn't happened some time;
> (5) file system can get file's temperature and
> to share the hint with block layer; (6) inode
> eviction method removes and free the data "temperature"
> object.
> 
> Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
> ---
>  fs/Kconfig                             |   2 +
>  fs/Makefile                            |   1 +
>  fs/data-temperature/Kconfig            |  11 +
>  fs/data-temperature/Makefile           |   3 +
>  fs/data-temperature/data_temperature.c | 347 +++++++++++++++++++++++++
>  include/linux/data_temperature.h       | 124 +++++++++
>  include/linux/fs.h                     |   4 +
>  mm/page-writeback.c                    |   9 +
>  8 files changed, 501 insertions(+)
>  create mode 100644 fs/data-temperature/Kconfig
>  create mode 100644 fs/data-temperature/Makefile
>  create mode 100644 fs/data-temperature/data_temperature.c
>  create mode 100644 include/linux/data_temperature.h
> 


This seems like an interesting idea, but how do you intend to use the
temperature?

With this patch, it looks like you're just calculating it, but there is
nothing that uses it and there is no way to access the temperature from
userland. It would be nice to see this value used by an existing
subsystem to drive data placement so we can see how it will help
things.

> diff --git a/fs/Kconfig b/fs/Kconfig
> index 64d420e3c475..ae117c2e3ce2 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -139,6 +139,8 @@ source "fs/autofs/Kconfig"
>  source "fs/fuse/Kconfig"
>  source "fs/overlayfs/Kconfig"
>  
> +source "fs/data-temperature/Kconfig"
> +
>  menu "Caches"
>  
>  source "fs/netfs/Kconfig"
> diff --git a/fs/Makefile b/fs/Makefile
> index 15df0a923d3a..c7e6ccac633d 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -129,3 +129,4 @@ obj-$(CONFIG_EROFS_FS)		+= erofs/
>  obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
>  obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
>  obj-$(CONFIG_BPF_LSM)		+= bpf_fs_kfuncs.o
> +obj-$(CONFIG_DATA_TEMPERATURE)	+= data-temperature/
> diff --git a/fs/data-temperature/Kconfig b/fs/data-temperature/Kconfig
> new file mode 100644
> index 000000000000..1cade2741982
> --- /dev/null
> +++ b/fs/data-temperature/Kconfig
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +config DATA_TEMPERATURE
> +	bool "Data temperature approach for efficient data placement"
> +	help
> +	  Enable data "temperature" estimation for efficient data
> +	  placement policy. This approach is file based and
> +	  it estimates "temperature" for every file independently.
> +	  The goal of the approach is to provide valuable hints
> +	  to file system or/and SSD for isolation and proper
> +	  managament of data with different temperatures.
> diff --git a/fs/data-temperature/Makefile b/fs/data-temperature/Makefile
> new file mode 100644
> index 000000000000..8e089a681360
> --- /dev/null
> +++ b/fs/data-temperature/Makefile
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +obj-$(CONFIG_DATA_TEMPERATURE) += data_temperature.o
> diff --git a/fs/data-temperature/data_temperature.c b/fs/data-temperature/data_temperature.c
> new file mode 100644
> index 000000000000..ea43fbfc3976
> --- /dev/null
> +++ b/fs/data-temperature/data_temperature.c
> @@ -0,0 +1,347 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Data "temperature" paradigm implementation
> + *
> + * Copyright (c) 2024-2025 Viacheslav Dubeyko <slava@dubeyko.com>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/pagemap.h>
> +#include <linux/data_temperature.h>
> +#include <linux/fs.h>
> +
> +#define TIME_IS_UNKNOWN		(U64_MAX)
> +
> +struct kmem_cache *data_temperature_info_cachep;
> +
> +static inline
> +void create_data_temperature_info(struct data_temperature *dt_info)
> +{
> +	if (!dt_info)
> +		return;
> +
> +	atomic_set(&dt_info->temperature, 0);
> +	dt_info->updated_blocks = 0;
> +	dt_info->dirty_blocks = 0;
> +	dt_info->start_timestamp = TIME_IS_UNKNOWN;
> +	dt_info->end_timestamp = TIME_IS_UNKNOWN;
> +	dt_info->state = DATA_TEMPERATURE_CREATED;
> +}
> +
> +static inline
> +void free_data_temperature_info(struct data_temperature *dt_info)
> +{
> +	if (!dt_info)
> +		return;
> +
> +	kmem_cache_free(data_temperature_info_cachep, dt_info);
> +}
> +
> +int __set_data_temperature_info(struct inode *inode)
> +{
> +	struct data_temperature *dt_info;
> +
> +	dt_info = kmem_cache_zalloc(data_temperature_info_cachep, GFP_KERNEL);
> +	if (!dt_info)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&dt_info->change_lock);
> +	create_data_temperature_info(dt_info);
> +
> +	if (cmpxchg_release(&inode->i_data_temperature_info,
> +					NULL, dt_info) != NULL) {
> +		free_data_temperature_info(dt_info);
> +		get_data_temperature_info(inode);
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(__set_data_temperature_info);
> +
> +void __remove_data_temperature_info(struct inode *inode)
> +{
> +	free_data_temperature_info(inode->i_data_temperature_info);
> +	inode->i_data_temperature_info = NULL;
> +}
> +EXPORT_SYMBOL_GPL(__remove_data_temperature_info);
> +
> +int __get_data_temperature(const struct inode *inode)
> +{
> +	struct data_temperature *dt_info;
> +
> +	if (!S_ISREG(inode->i_mode))
> +		return 0;
> +
> +	dt_info = get_data_temperature_info(inode);
> +	if (IS_ERR_OR_NULL(dt_info))
> +		return 0;
> +
> +	return atomic_read(&dt_info->temperature);
> +}
> +EXPORT_SYMBOL_GPL(__get_data_temperature);
> +
> +static inline
> +bool is_timestamp_invalid(struct data_temperature *dt_info)
> +{
> +	if (!dt_info)
> +		return false;
> +
> +	if (dt_info->start_timestamp == TIME_IS_UNKNOWN ||
> +	    dt_info->end_timestamp == TIME_IS_UNKNOWN)
> +		return true;
> +
> +	if (dt_info->start_timestamp > dt_info->end_timestamp)
> +		return true;
> +
> +	return false;
> +}
> +
> +static inline
> +u64 get_current_timestamp(void)
> +{
> +	return ktime_get_boottime_ns();
> +}
> +
> +static inline
> +void start_account_data_temperature_info(struct data_temperature *dt_info)
> +{
> +	if (!dt_info)
> +		return;
> +
> +	dt_info->dirty_blocks = 1;
> +	dt_info->start_timestamp = get_current_timestamp();
> +	dt_info->end_timestamp = TIME_IS_UNKNOWN;
> +	dt_info->state = DATA_TEMPERATURE_UPDATE_STARTED;
> +}
> +
> +static inline
> +void __increase_data_temperature(struct inode *inode,
> +				 struct data_temperature *dt_info)
> +{
> +	u64 bytes_count;
> +	u64 file_blocks;
> +	u32 block_bytes;
> +	int dirty_blocks_ratio;
> +	int updated_blocks_ratio;
> +	int old_temperature;
> +	int calculated;
> +
> +	if (!inode || !dt_info)
> +		return;
> +
> +	block_bytes = 1 << inode->i_blkbits;
> +	bytes_count = i_size_read(inode) + block_bytes - 1;
> +	file_blocks = bytes_count >> inode->i_blkbits;
> +
> +	dt_info->dirty_blocks++;
> +
> +	if (file_blocks > 0) {
> +		old_temperature = atomic_read(&dt_info->temperature);
> +
> +		dirty_blocks_ratio = div_u64(dt_info->dirty_blocks,
> +						file_blocks);
> +		updated_blocks_ratio = div_u64(dt_info->updated_blocks,
> +						file_blocks);
> +		calculated = max_t(int, dirty_blocks_ratio,
> +					updated_blocks_ratio);
> +
> +		if (calculated > 0 && old_temperature < calculated)
> +			atomic_set(&dt_info->temperature, calculated);
> +	}
> +}
> +
> +static inline
> +void __decrease_data_temperature(struct inode *inode,
> +				 struct data_temperature *dt_info)
> +{
> +	u64 timestamp;
> +	u64 time_range;
> +	u64 time_diff;
> +	u64 bytes_count;
> +	u64 file_blocks;
> +	u32 block_bytes;
> +	u64 blks_per_temperature_degree;
> +	u64 ns_per_block;
> +	u64 temperature_diff;
> +
> +	if (!inode || !dt_info)
> +		return;
> +
> +	if (is_timestamp_invalid(dt_info)) {
> +		create_data_temperature_info(dt_info);
> +		return;
> +	}
> +
> +	timestamp = get_current_timestamp();
> +
> +	if (dt_info->end_timestamp > timestamp) {
> +		create_data_temperature_info(dt_info);
> +		return;
> +	}
> +
> +	time_range = dt_info->end_timestamp - dt_info->start_timestamp;
> +	time_diff = timestamp - dt_info->end_timestamp;
> +
> +	block_bytes = 1 << inode->i_blkbits;
> +	bytes_count = i_size_read(inode) + block_bytes - 1;
> +	file_blocks = bytes_count >> inode->i_blkbits;
> +
> +	blks_per_temperature_degree = file_blocks;
> +	if (blks_per_temperature_degree == 0) {
> +		start_account_data_temperature_info(dt_info);
> +		return;
> +	}
> +
> +	if (dt_info->updated_blocks == 0 || time_range == 0) {
> +		start_account_data_temperature_info(dt_info);
> +		return;
> +	}
> +
> +	ns_per_block = div_u64(time_range, dt_info->updated_blocks);
> +	if (ns_per_block == 0)
> +		ns_per_block = 1;
> +
> +	if (time_diff == 0) {
> +		start_account_data_temperature_info(dt_info);
> +		return;
> +	}
> +
> +	temperature_diff = div_u64(time_diff, ns_per_block);
> +	temperature_diff = div_u64(temperature_diff,
> +					blks_per_temperature_degree);
> +
> +	if (temperature_diff == 0)
> +		return;
> +
> +	if (temperature_diff <= atomic_read(&dt_info->temperature)) {
> +		atomic_sub(temperature_diff, &dt_info->temperature);
> +		dt_info->updated_blocks -=
> +			temperature_diff * blks_per_temperature_degree;
> +	} else {
> +		atomic_set(&dt_info->temperature, 0);
> +		dt_info->updated_blocks = 0;
> +	}
> +}
> +
> +int __increase_data_temperature_by_dirty_folio(struct folio *folio)
> +{
> +	struct inode *inode;
> +	struct data_temperature *dt_info;
> +
> +	if (!folio || !folio->mapping)
> +		return 0;
> +
> +	inode = folio_inode(folio);
> +
> +	if (!S_ISREG(inode->i_mode))
> +		return 0;
> +
> +	dt_info = get_data_temperature_info(inode);
> +	if (IS_ERR_OR_NULL(dt_info))
> +		return 0;
> +
> +	spin_lock(&dt_info->change_lock);
> +	switch (dt_info->state) {
> +	case DATA_TEMPERATURE_CREATED:
> +		atomic_set(&dt_info->temperature, 0);
> +		start_account_data_temperature_info(dt_info);
> +		break;
> +
> +	case DATA_TEMPERATURE_UPDATE_STARTED:
> +		__increase_data_temperature(inode, dt_info);
> +		break;
> +
> +	case DATA_TEMPERATURE_UPDATE_FINISHED:
> +		__decrease_data_temperature(inode, dt_info);
> +		start_account_data_temperature_info(dt_info);
> +		break;
> +
> +	default:
> +		/* do nothing */
> +		break;
> +	}
> +	spin_unlock(&dt_info->change_lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(__increase_data_temperature_by_dirty_folio);
> +
> +static inline
> +void decrement_dirty_blocks(struct data_temperature *dt_info)
> +{
> +	if (!dt_info)
> +		return;
> +
> +	if (dt_info->dirty_blocks > 0) {
> +		dt_info->dirty_blocks--;
> +		dt_info->updated_blocks++;
> +	}
> +}
> +
> +static inline
> +void finish_increasing_data_temperature(struct data_temperature *dt_info)
> +{
> +	if (!dt_info)
> +		return;
> +
> +	if (dt_info->dirty_blocks == 0) {
> +		dt_info->end_timestamp = get_current_timestamp();
> +		dt_info->state = DATA_TEMPERATURE_UPDATE_FINISHED;
> +	}
> +}
> +
> +int __account_flushed_folio_by_data_temperature(struct folio *folio)
> +{
> +	struct inode *inode;
> +	struct data_temperature *dt_info;
> +
> +	if (!folio || !folio->mapping)
> +		return 0;
> +
> +	inode = folio_inode(folio);
> +
> +	if (!S_ISREG(inode->i_mode))
> +		return 0;
> +
> +	dt_info = get_data_temperature_info(inode);
> +	if (IS_ERR_OR_NULL(dt_info))
> +		return 0;
> +
> +	spin_lock(&dt_info->change_lock);
> +	switch (dt_info->state) {
> +	case DATA_TEMPERATURE_CREATED:
> +		create_data_temperature_info(dt_info);
> +		break;
> +
> +	case DATA_TEMPERATURE_UPDATE_STARTED:
> +		if (dt_info->dirty_blocks > 0)
> +			decrement_dirty_blocks(dt_info);
> +		if (dt_info->dirty_blocks == 0)
> +			finish_increasing_data_temperature(dt_info);
> +		break;
> +
> +	case DATA_TEMPERATURE_UPDATE_FINISHED:
> +		/* do nothing */
> +		break;
> +
> +	default:
> +		/* do nothing */
> +		break;
> +	}
> +	spin_unlock(&dt_info->change_lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(__account_flushed_folio_by_data_temperature);
> +
> +static int __init data_temperature_init(void)
> +{
> +	data_temperature_info_cachep = KMEM_CACHE(data_temperature,
> +						  SLAB_RECLAIM_ACCOUNT);
> +	if (!data_temperature_info_cachep)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +late_initcall(data_temperature_init)
> diff --git a/include/linux/data_temperature.h b/include/linux/data_temperature.h
> new file mode 100644
> index 000000000000..40abf6322385
> --- /dev/null
> +++ b/include/linux/data_temperature.h
> @@ -0,0 +1,124 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Data "temperature" paradigm declarations
> + *
> + * Copyright (c) 2024-2025 Viacheslav Dubeyko <slava@dubeyko.com>
> + */
> +
> +#ifndef _LINUX_DATA_TEMPERATURE_H
> +#define _LINUX_DATA_TEMPERATURE_H
> +
> +/*
> + * struct data_temperature - data temperature definition
> + * @temperature: current temperature of a file
> + * @change_lock: modification lock
> + * @state: current state of data temperature object
> + * @dirty_blocks: current number of dirty blocks in page cache
> + * @updated_blocks: number of updated blocks [start_timestamp, end_timestamp]
> + * @start_timestamp: starting timestamp of update operations
> + * @end_timestamp: finishing timestamp of update operations
> + */
> +struct data_temperature {
> +	atomic_t temperature;
> +
> +	spinlock_t change_lock;
> +	int state;
> +	u64 dirty_blocks;
> +	u64 updated_blocks;
> +	u64 start_timestamp;
> +	u64 end_timestamp;
> +};
> +
> +enum data_temperature_state {
> +	DATA_TEMPERATURE_UNKNOWN_STATE,
> +	DATA_TEMPERATURE_CREATED,
> +	DATA_TEMPERATURE_UPDATE_STARTED,
> +	DATA_TEMPERATURE_UPDATE_FINISHED,
> +	DATA_TEMPERATURE_STATE_MAX
> +};
> +
> +#ifdef CONFIG_DATA_TEMPERATURE
> +
> +int __set_data_temperature_info(struct inode *inode);
> +void __remove_data_temperature_info(struct inode *inode);
> +int __get_data_temperature(const struct inode *inode);
> +int __increase_data_temperature_by_dirty_folio(struct folio *folio);
> +int __account_flushed_folio_by_data_temperature(struct folio *folio);
> +
> +static inline
> +struct data_temperature *get_data_temperature_info(const struct inode *inode)
> +{
> +	return smp_load_acquire(&inode->i_data_temperature_info);
> +}
> +
> +static inline
> +int set_data_temperature_info(struct inode *inode)
> +{
> +	return __set_data_temperature_info(inode);
> +}
> +
> +static inline
> +void remove_data_temperature_info(struct inode *inode)
> +{
> +	__remove_data_temperature_info(inode);
> +}
> +
> +static inline
> +int get_data_temperature(const struct inode *inode)
> +{
> +	return __get_data_temperature(inode);
> +}
> +
> +static inline
> +int increase_data_temperature_by_dirty_folio(struct folio *folio)
> +{
> +	return __increase_data_temperature_by_dirty_folio(folio);
> +}
> +
> +static inline
> +int account_flushed_folio_by_data_temperature(struct folio *folio)
> +{
> +	return __account_flushed_folio_by_data_temperature(folio);
> +}
> +
> +#else  /* !CONFIG_DATA_TEMPERATURE */
> +
> +static inline
> +int set_data_temperature_info(struct inode *inode)
> +{
> +	return 0;
> +}
> +
> +static inline
> +void remove_data_temperature_info(struct inode *inode)
> +{
> +	return;
> +}
> +
> +static inline
> +struct data_temperature *get_data_temperature_info(const struct inode *inode)
> +{
> +	return ERR_PTR(-EOPNOTSUPP);
> +}
> +
> +static inline
> +int get_data_temperature(const struct inode *inode)
> +{
> +	return 0;
> +}
> +
> +static inline
> +int increase_data_temperature_by_dirty_folio(struct folio *folio)
> +{
> +	return 0;
> +}
> +
> +static inline
> +int account_flushed_folio_by_data_temperature(struct folio *folio)
> +{
> +	return 0;
> +}
> +
> +#endif	/* CONFIG_DATA_TEMPERATURE */
> +
> +#endif	/* _LINUX_DATA_TEMPERATURE_H */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a4af70367f8a..57c4810a28a0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -753,6 +753,10 @@ struct inode {
>  	struct fsverity_info	*i_verity_info;
>  #endif
>  
> +#ifdef CONFIG_DATA_TEMPERATURE
> +	struct data_temperature		*i_data_temperature_info;
> +#endif
> +
>  	void			*i_private; /* fs or device private pointer */
>  } __randomize_layout;
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d9861e42b2bd..5de458b7fefc 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -38,6 +38,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/sched/signal.h>
>  #include <linux/mm_inline.h>
> +#include <linux/data_temperature.h>
>  #include <trace/events/writeback.h>
>  
>  #include "internal.h"
> @@ -2775,6 +2776,10 @@ static void folio_account_dirtied(struct folio *folio,
>  		__this_cpu_add(bdp_ratelimits, nr);
>  
>  		mem_cgroup_track_foreign_dirty(folio, wb);
> +
> +#ifdef CONFIG_DATA_TEMPERATURE
> +		increase_data_temperature_by_dirty_folio(folio);
> +#endif	/* CONFIG_DATA_TEMPERATURE */
>  	}
>  }
>  
> @@ -3006,6 +3011,10 @@ bool folio_clear_dirty_for_io(struct folio *folio)
>  
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>  
> +#ifdef CONFIG_DATA_TEMPERATURE
> +	account_flushed_folio_by_data_temperature(folio);
> +#endif	/* CONFIG_DATA_TEMPERATURE */
> +
>  	if (mapping && mapping_can_writeback(mapping)) {
>  		struct inode *inode = mapping->host;
>  		struct bdi_writeback *wb;
Hans Holmberg Jan. 27, 2025, 2:19 p.m. UTC | #4
On Fri, Jan 24, 2025 at 10:03 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Fri, 2025-01-24 at 08:19 +0000, Johannes Thumshirn wrote:
> > On 23.01.25 21:30, Viacheslav Dubeyko wrote:
> > > [PROBLEM DECLARATION]
> > > Efficient data placement policy is a Holy Grail for data
> > > storage and file system engineers. Achieving this goal is
> > > equally important and really hard. Multiple data storage
> > > and file system technologies have been invented to manage
> > > the data placement policy (for example, COW, ZNS, FDP, etc).
> > > But these technologies still require the hints related to
> > > nature of data from application side.
> > >
> > > [DATA "TEMPERATURE" CONCEPT]
> > > One of the widely used and intuitively clear idea of data
> > > nature definition is data "temperature" (cold, warm,
> > > hot data). However, data "temperature" is as intuitively
> > > sound as illusive definition of data nature. Generally
> > > speaking, thermodynamics defines temperature as a way
> > > to estimate the average kinetic energy of vibrating
> > > atoms in a substance. But we cannot see a direct analogy
> > > between data "temperature" and temperature in physics
> > > because data is not something that has kinetic energy.
> > >
> > > [WHAT IS GENERALIZED DATA "TEMPERATURE" ESTIMATION]
> > > We usually imply that if some data is updated more
> > > frequently, then such data is more hot than other one.
> > > But, it is possible to see several problems here:
> > > (1) How can we estimate the data "hotness" in
> > > quantitative way? (2) We can state that data is "hot"
> > > after some number of updates. It means that this
> > > definition implies state of the data in the past.
> > > Will this data continue to be "hot" in the future?
> > > Generally speaking, the crucial problem is how to define
> > > the data nature or data "temperature" in the future.
> > > Because, this knowledge is the fundamental basis for
> > > elaboration an efficient data placement policy.
> > > Generalized data "temperature" estimation framework
> > > suggests the way to define a future state of the data
> > > and the basis for quantitative measurement of data
> > > "temperature".
> > >
> > > [ARCHITECTURE OF FRAMEWORK]
> > > Usually, file system has a page cache for every inode. And
> > > initially memory pages become dirty in page cache. Finally,
> > > dirty pages will be sent to storage device. Technically
> > > speaking, the number of dirty pages in a particular page
> > > cache is the quantitative measurement of current "hotness"
> > > of a file. But number of dirty pages is still not stable
> > > basis for quantitative measurement of data "temperature".
> > > It is possible to suggest of using the total number of
> > > logical blocks in a file as a unit of one degree of data
> > > "temperature". As a result, if the whole file was updated
> > > several times, then "temperature" of the file has been
> > > increased for several degrees. And if the file is under
> > > continous updates, then the file "temperature" is growing.
> > >
> > > We need to keep not only current number of dirty pages,
> > > but also the number of updated pages in the near past
> > > for accumulating the total "temperature" of a file.
> > > Generally speaking, total number of updated pages in the
> > > nearest past defines the aggregated "temperature" of file.
> > > And number of dirty pages defines the delta of
> > > "temperature" growth for current update operation.
> > > This approach defines the mechanism of "temperature" growth.
> > >
> > > But if we have no more updates for the file, then
> > > "temperature" needs to decrease. Starting and ending
> > > timestamps of update operation can work as a basis for
> > > decreasing "temperature" of a file. If we know the number
> > > of updated logical blocks of the file, then we can divide
> > > the duration of update operation on number of updated
> > > logical blocks. As a result, this is the way to define
> > > a time duration per one logical block. By means of
> > > multiplying this value (time duration per one logical
> > > block) on total number of logical blocks in file, we
> > > can calculate the time duration of "temperature"
> > > decreasing for one degree. Finally, the operation of
> > > division the time range (between end of last update
> > > operation and begin of new update operation) on
> > > the time duration of "temperature" decreasing for
> > > one degree provides the way to define how many
> > > degrees should be subtracted from current "temperature"
> > > of the file.
> > >
> > > [HOW TO USE THE APPROACH]
> > > The lifetime of data "temperature" value for a file
> > > can be explained by steps: (1) iget() method sets
> > > the data "temperature" object; (2) folio_account_dirtied()
> > > method accounts the number of dirty memory pages and
> > > tries to estimate the current temperature of the file;
> > > (3) folio_clear_dirty_for_io() decrease number of dirty
> > > memory pages and increases number of updated pages;
> > > (4) folio_account_dirtied() also decreases file's
> > > "temperature" if updates hasn't happened some time;
> > > (5) file system can get file's temperature and
> > > to share the hint with block layer; (6) inode
> > > eviction method removes and free the data "temperature"
> > > object.
> >
> > I don't want to pour gasoline on old flame wars, but what is the
> > advantage of this auto-magic data temperature framework vs the existing
> > framework?
> >
>
> There is no magic in this framework. :) It's simple and compact framework.
>
> >  'enum rw_hint' has temperature in the range of none, short,
> > medium, long and extreme (what ever that means), can be set by an
> > application via an fcntl() and is plumbed down all the way to the bio
> > level by most FSes that care.
>
> I see your point. But the 'enum rw_hint' defines qualitative grades again:
>
> enum rw_hint {
>         WRITE_LIFE_NOT_SET      = RWH_WRITE_LIFE_NOT_SET,
>         WRITE_LIFE_NONE         = RWH_WRITE_LIFE_NONE,
>         WRITE_LIFE_SHORT        = RWH_WRITE_LIFE_SHORT,  <-- HOT data
>         WRITE_LIFE_MEDIUM       = RWH_WRITE_LIFE_MEDIUM, <-- WARM data
>         WRITE_LIFE_LONG         = RWH_WRITE_LIFE_LONG,   <-- COLD data
>         WRITE_LIFE_EXTREME      = RWH_WRITE_LIFE_EXTREME,
> } __packed;
>
> First of all, again, it's hard to compare the hotness of different files
> on such qualitative basis. Secondly, who decides what is hotness of a particular
> data? People can only guess or assume the nature of data based on
> experience in the past. But workloads are changing and evolving
> continuously and in real-time manner. Technically speaking, application can
> try to estimate the hotness of data, but, again, file system can receive
> requests from multiple threads and multiple applications. So, application
> can guess about real nature of data too. Especially, nobody would like
> to implement dedicated logic in application for data hotness estimation.
>
> This framework is inode based and it tries to estimate file's
> "temperature" on quantitative basis. Advantages of this framework:
> (1) we don't need to guess about data hotness, temperature will be
> calculated quantitatively; (2) quantitative basis gives opportunity
> for fair comparison of different files' temperature; (3) file's temperature
> will change with workload(s) changing in real-time; (4) file's
> temperature will be correctly accounted under the load from multiple
> applications. I believe these are advantages of the suggested framework.
>

While I think the general idea(using file-overwrite-rates as a
parameter when doing data placement) could be useful, it could not
replace the user space hinting we already have.

Applications(e.g. RocksDB) doing sequential writes to files that are
immutable until deleted(no overwrites) would not benefit. We need user
space help to estimate data lifetime for those workloads and the
relative write lifetime hints are useful for that.

So what I am asking myself is if this framework is added, who would
benefit? Without any benchmark results it's a bit hard to tell :)

Also, is there a good reason for only supporting buffered io? Direct
IO could benefit in the same way, right?

Thanks,
Hans
Viacheslav Dubeyko Jan. 27, 2025, 8:12 p.m. UTC | #5
On Sat, 2025-01-25 at 07:25 -0500, Jeff Layton wrote:
> On Thu, 2025-01-23 at 12:24 -0800, Viacheslav Dubeyko wrote:
> > [PROBLEM DECLARATION]
> > Efficient data placement policy is a Holy Grail for data
> > storage and file system engineers. Achieving this goal is
> > equally important and really hard. Multiple data storage
> > and file system technologies have been invented to manage
> > the data placement policy (for example, COW, ZNS, FDP, etc).
> > But these technologies still require the hints related to
> > nature of data from application side.
> > 
> > [DATA "TEMPERATURE" CONCEPT]
> > One of the widely used and intuitively clear idea of data
> > nature definition is data "temperature" (cold, warm,
> > hot data). However, data "temperature" is as intuitively
> > sound as illusive definition of data nature. Generally
> > speaking, thermodynamics defines temperature as a way
> > to estimate the average kinetic energy of vibrating
> > atoms in a substance. But we cannot see a direct analogy
> > between data "temperature" and temperature in physics
> > because data is not something that has kinetic energy.
> > 
> > [WHAT IS GENERALIZED DATA "TEMPERATURE" ESTIMATION]
> > We usually imply that if some data is updated more
> > frequently, then such data is more hot than other one.
> > But, it is possible to see several problems here:
> > (1) How can we estimate the data "hotness" in
> > quantitative way? (2) We can state that data is "hot"
> > after some number of updates. It means that this
> > definition implies state of the data in the past.
> > Will this data continue to be "hot" in the future?
> > Generally speaking, the crucial problem is how to define
> > the data nature or data "temperature" in the future.
> > Because, this knowledge is the fundamental basis for
> > elaboration an efficient data placement policy.
> > Generalized data "temperature" estimation framework
> > suggests the way to define a future state of the data
> > and the basis for quantitative measurement of data
> > "temperature".
> > 
> > [ARCHITECTURE OF FRAMEWORK]
> > Usually, file system has a page cache for every inode. And
> > initially memory pages become dirty in page cache. Finally,
> > dirty pages will be sent to storage device. Technically
> > speaking, the number of dirty pages in a particular page
> > cache is the quantitative measurement of current "hotness"
> > of a file. But number of dirty pages is still not stable
> > basis for quantitative measurement of data "temperature".
> > It is possible to suggest of using the total number of
> > logical blocks in a file as a unit of one degree of data
> > "temperature". As a result, if the whole file was updated
> > several times, then "temperature" of the file has been
> > increased for several degrees. And if the file is under
> > continous updates, then the file "temperature" is growing.
> > 
> > We need to keep not only current number of dirty pages,
> > but also the number of updated pages in the near past
> > for accumulating the total "temperature" of a file.
> > Generally speaking, total number of updated pages in the
> > nearest past defines the aggregated "temperature" of file.
> > And number of dirty pages defines the delta of
> > "temperature" growth for current update operation.
> > This approach defines the mechanism of "temperature" growth.
> > 
> > But if we have no more updates for the file, then
> > "temperature" needs to decrease. Starting and ending
> > timestamps of update operation can work as a basis for
> > decreasing "temperature" of a file. If we know the number
> > of updated logical blocks of the file, then we can divide
> > the duration of update operation on number of updated
> > logical blocks. As a result, this is the way to define
> > a time duration per one logical block. By means of
> > multiplying this value (time duration per one logical
> > block) on total number of logical blocks in file, we
> > can calculate the time duration of "temperature"
> > decreasing for one degree. Finally, the operation of
> > division the time range (between end of last update
> > operation and begin of new update operation) on
> > the time duration of "temperature" decreasing for
> > one degree provides the way to define how many
> > degrees should be subtracted from current "temperature"
> > of the file.
> > 
> > [HOW TO USE THE APPROACH]
> > The lifetime of data "temperature" value for a file
> > can be explained by steps: (1) iget() method sets
> > the data "temperature" object; (2) folio_account_dirtied()
> > method accounts the number of dirty memory pages and
> > tries to estimate the current temperature of the file;
> > (3) folio_clear_dirty_for_io() decrease number of dirty
> > memory pages and increases number of updated pages;
> > (4) folio_account_dirtied() also decreases file's
> > "temperature" if updates hasn't happened some time;
> > (5) file system can get file's temperature and
> > to share the hint with block layer; (6) inode
> > eviction method removes and free the data "temperature"
> > object.
> > 
> > Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
> > ---
> >  fs/Kconfig                             |   2 +
> >  fs/Makefile                            |   1 +
> >  fs/data-temperature/Kconfig            |  11 +
> >  fs/data-temperature/Makefile           |   3 +
> >  fs/data-temperature/data_temperature.c | 347 +++++++++++++++++++++++++
> >  include/linux/data_temperature.h       | 124 +++++++++
> >  include/linux/fs.h                     |   4 +
> >  mm/page-writeback.c                    |   9 +
> >  8 files changed, 501 insertions(+)
> >  create mode 100644 fs/data-temperature/Kconfig
> >  create mode 100644 fs/data-temperature/Makefile
> >  create mode 100644 fs/data-temperature/data_temperature.c
> >  create mode 100644 include/linux/data_temperature.h
> > 
> 
> 
> This seems like an interesting idea, but how do you intend to use the
> temperature?
> 

Yes, it's not complete implementation. The complete implementation requires of
modification of particular file system(s). And I am simply sharing the initial
vision.

Potentially, different file system can use the temperature in different way. The
simplest approach is to provide the temperature as a hint for block layer and
this hint value can be used by FDP SSD, for example. But file system itself can
use temperature value for elaborating data placement policy. If file system uses
segment concept, then different type of segments can store data with different
temperature. Usually, it is easy to store different types of metadata in
different segments. However, even different types of metadata could be grouped
on temperature basis. But proper placement policy for user data is always hard
point for file system. So, temperature basis provides the way to introduce a set
of segments that can receive user data with different temperature.

But even if file system doesn't use the segment concept, then multiple file
systems use the Allocation Groups concept. And, potentially, files with
different temperatures can be stored or grouped into different Allocation
groups.

I believe, potentially, GC subsystem of LFS file systems can use the temperature
to elaborate more efficient policy. Because it is clear that files' content with
high temperature don't need to be processed by GC. I don't have in mind the
clear algorithm of this policy, but hot segments can be cleaned without GC
intervention, for example.

Also, interesting point that this approach is trying to decrease temperature if
number of updates is decreasing. It means that COW policy can store file's
content in segments with different temperature for every update of following to
temperature changing with time. However, different portion of big file can be
distributed among multiple segments. But, big file is always distributed among
multiple segments. 

> With this patch, it looks like you're just calculating it, but there is
> nothing that uses it and there is no way to access the temperature from
> userland. It would be nice to see this value used by an existing
> subsystem to drive data placement so we can see how it will help
> things.
> 
> > 

I did benchmarking by using SSDFS file system (but any other file system can  be
used for benchmarking too). And I am going to introduce several current segments
for user data with the goal to distribute user data with various temperature.
Also, as I mentioned, these current segments can be stored by providing hints to
FDP SSD through block layer logic. And I shared above potential ways how various
file systems can employ the calculated temperature.

Related to userland... I didn't consider to share the temperature with user-
space subsystems. But it is the great point. Potentially, it is easy to
introduce an ioctl that can retrieve the temperature of a particular file. Or
maybe sysfs can be used to expose the distribution of data among temperature
groups/ranges. And application can use this data to elaborate data placement
policy. Let me think about it more.

Thanks,
Slava.
Viacheslav Dubeyko Jan. 27, 2025, 8:58 p.m. UTC | #6
On Mon, 2025-01-27 at 15:19 +0100, Hans Holmberg wrote:
> On Fri, Jan 24, 2025 at 10:03 PM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> > 
> > On Fri, 2025-01-24 at 08:19 +0000, Johannes Thumshirn wrote:
> > > On 23.01.25 21:30, Viacheslav Dubeyko wrote:
> > > > [PROBLEM DECLARATION]
> > > > Efficient data placement policy is a Holy Grail for data
> > > > storage and file system engineers. Achieving this goal is
> > > > equally important and really hard. Multiple data storage
> > > > and file system technologies have been invented to manage
> > > > the data placement policy (for example, COW, ZNS, FDP, etc).
> > > > But these technologies still require the hints related to
> > > > nature of data from application side.
> > > > 
> > > > [DATA "TEMPERATURE" CONCEPT]
> > > > One of the widely used and intuitively clear idea of data
> > > > nature definition is data "temperature" (cold, warm,
> > > > hot data). However, data "temperature" is as intuitively
> > > > sound as illusive definition of data nature. Generally
> > > > speaking, thermodynamics defines temperature as a way
> > > > to estimate the average kinetic energy of vibrating
> > > > atoms in a substance. But we cannot see a direct analogy
> > > > between data "temperature" and temperature in physics
> > > > because data is not something that has kinetic energy.
> > > > 
> > > > [WHAT IS GENERALIZED DATA "TEMPERATURE" ESTIMATION]
> > > > We usually imply that if some data is updated more
> > > > frequently, then such data is more hot than other one.
> > > > But, it is possible to see several problems here:
> > > > (1) How can we estimate the data "hotness" in
> > > > quantitative way? (2) We can state that data is "hot"
> > > > after some number of updates. It means that this
> > > > definition implies state of the data in the past.
> > > > Will this data continue to be "hot" in the future?
> > > > Generally speaking, the crucial problem is how to define
> > > > the data nature or data "temperature" in the future.
> > > > Because, this knowledge is the fundamental basis for
> > > > elaboration an efficient data placement policy.
> > > > Generalized data "temperature" estimation framework
> > > > suggests the way to define a future state of the data
> > > > and the basis for quantitative measurement of data
> > > > "temperature".
> > > > 
> > > > [ARCHITECTURE OF FRAMEWORK]
> > > > Usually, file system has a page cache for every inode. And
> > > > initially memory pages become dirty in page cache. Finally,
> > > > dirty pages will be sent to storage device. Technically
> > > > speaking, the number of dirty pages in a particular page
> > > > cache is the quantitative measurement of current "hotness"
> > > > of a file. But number of dirty pages is still not stable
> > > > basis for quantitative measurement of data "temperature".
> > > > It is possible to suggest of using the total number of
> > > > logical blocks in a file as a unit of one degree of data
> > > > "temperature". As a result, if the whole file was updated
> > > > several times, then "temperature" of the file has been
> > > > increased for several degrees. And if the file is under
> > > > continous updates, then the file "temperature" is growing.
> > > > 
> > > > We need to keep not only current number of dirty pages,
> > > > but also the number of updated pages in the near past
> > > > for accumulating the total "temperature" of a file.
> > > > Generally speaking, total number of updated pages in the
> > > > nearest past defines the aggregated "temperature" of file.
> > > > And number of dirty pages defines the delta of
> > > > "temperature" growth for current update operation.
> > > > This approach defines the mechanism of "temperature" growth.
> > > > 
> > > > But if we have no more updates for the file, then
> > > > "temperature" needs to decrease. Starting and ending
> > > > timestamps of update operation can work as a basis for
> > > > decreasing "temperature" of a file. If we know the number
> > > > of updated logical blocks of the file, then we can divide
> > > > the duration of update operation on number of updated
> > > > logical blocks. As a result, this is the way to define
> > > > a time duration per one logical block. By means of
> > > > multiplying this value (time duration per one logical
> > > > block) on total number of logical blocks in file, we
> > > > can calculate the time duration of "temperature"
> > > > decreasing for one degree. Finally, the operation of
> > > > division the time range (between end of last update
> > > > operation and begin of new update operation) on
> > > > the time duration of "temperature" decreasing for
> > > > one degree provides the way to define how many
> > > > degrees should be subtracted from current "temperature"
> > > > of the file.
> > > > 
> > > > [HOW TO USE THE APPROACH]
> > > > The lifetime of data "temperature" value for a file
> > > > can be explained by steps: (1) iget() method sets
> > > > the data "temperature" object; (2) folio_account_dirtied()
> > > > method accounts the number of dirty memory pages and
> > > > tries to estimate the current temperature of the file;
> > > > (3) folio_clear_dirty_for_io() decrease number of dirty
> > > > memory pages and increases number of updated pages;
> > > > (4) folio_account_dirtied() also decreases file's
> > > > "temperature" if updates hasn't happened some time;
> > > > (5) file system can get file's temperature and
> > > > to share the hint with block layer; (6) inode
> > > > eviction method removes and free the data "temperature"
> > > > object.
> > > 
> > > I don't want to pour gasoline on old flame wars, but what is the
> > > advantage of this auto-magic data temperature framework vs the existing
> > > framework?
> > > 
> > 
> > There is no magic in this framework. :) It's simple and compact framework.
> > 
> > >  'enum rw_hint' has temperature in the range of none, short,
> > > medium, long and extreme (what ever that means), can be set by an
> > > application via an fcntl() and is plumbed down all the way to the bio
> > > level by most FSes that care.
> > 
> > I see your point. But the 'enum rw_hint' defines qualitative grades again:
> > 
> > enum rw_hint {
> >         WRITE_LIFE_NOT_SET      = RWH_WRITE_LIFE_NOT_SET,
> >         WRITE_LIFE_NONE         = RWH_WRITE_LIFE_NONE,
> >         WRITE_LIFE_SHORT        = RWH_WRITE_LIFE_SHORT,  <-- HOT data
> >         WRITE_LIFE_MEDIUM       = RWH_WRITE_LIFE_MEDIUM, <-- WARM data
> >         WRITE_LIFE_LONG         = RWH_WRITE_LIFE_LONG,   <-- COLD data
> >         WRITE_LIFE_EXTREME      = RWH_WRITE_LIFE_EXTREME,
> > } __packed;
> > 
> > First of all, again, it's hard to compare the hotness of different files
> > on such qualitative basis. Secondly, who decides what is hotness of a particular
> > data? People can only guess or assume the nature of data based on
> > experience in the past. But workloads are changing and evolving
> > continuously and in real-time manner. Technically speaking, application can
> > try to estimate the hotness of data, but, again, file system can receive
> > requests from multiple threads and multiple applications. So, application
> > can guess about real nature of data too. Especially, nobody would like
> > to implement dedicated logic in application for data hotness estimation.
> > 
> > This framework is inode based and it tries to estimate file's
> > "temperature" on quantitative basis. Advantages of this framework:
> > (1) we don't need to guess about data hotness, temperature will be
> > calculated quantitatively; (2) quantitative basis gives opportunity
> > for fair comparison of different files' temperature; (3) file's temperature
> > will change with workload(s) changing in real-time; (4) file's
> > temperature will be correctly accounted under the load from multiple
> > applications. I believe these are advantages of the suggested framework.
> > 
> 
> While I think the general idea(using file-overwrite-rates as a
> parameter when doing data placement) could be useful, it could not
> replace the user space hinting we already have.
> 
> Applications(e.g. RocksDB) doing sequential writes to files that are
> immutable until deleted(no overwrites) would not benefit. We need user
> space help to estimate data lifetime for those workloads and the
> relative write lifetime hints are useful for that.
> 

I don't see any competition or conflict here. Suggested approach and user-space
hinting could be complementary techniques. If user-space logic would like to use
a special data placement policy, then it can share hints in its own way. But,
potentially, suggested approach of temperature calculation can be used to check
the effectiveness of the user-space hinting, and, maybe, correcting it. So, I
don't see any conflict here. 

> So what I am asking myself is if this framework is added, who would
> benefit? Without any benchmark results it's a bit hard to tell :)
> 

Which benefits would you like to see? I assume we would like: (1) prolong device
lifetime, (2) improve performance, (3) decrease GC burden. Do you mean these
benefits?

As far as I can see, different file systems can use temperature in different
way. And this is slightly complicates the benchmarking. So, how can we define
the effectiveness here and how can we measure it? Do you have a vision here? I
am happy to make more benchmarking.

My point is that the calculated file's temperature gives the quantitative way to
distribute even user data among several temperature groups ("baskets"). And
these baskets/segments/anything-else gives the way to properly group data. File
systems can employ the temperature in various ways, but it can definitely helps
to elaborate proper data placement policy. As a result, GC burden can be
decreased, performance can be improved, and lifetime device can be prolong. So,
how can we benchmark these points? And which approaches make sense to compare? 

> Also, is there a good reason for only supporting buffered io? Direct
> IO could benefit in the same way, right?
> 

I think that Direct IO could benefit too. The question here how to account dirty
memory pages and updated memory pages. Currently, I am using
folio_account_dirtied() and folio_clear_dirty_for_io() to implement the
calculation the temperature. As far as I can see, Direct IO requires another
methods of doing this. The rest logic can be the same.

Thanks,
Slava.
Hans Holmberg Jan. 28, 2025, 8:45 a.m. UTC | #7
On Mon, Jan 27, 2025 at 9:59 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Mon, 2025-01-27 at 15:19 +0100, Hans Holmberg wrote:
> > On Fri, Jan 24, 2025 at 10:03 PM Viacheslav Dubeyko
> > <Slava.Dubeyko@ibm.com> wrote:
> > >
> > > On Fri, 2025-01-24 at 08:19 +0000, Johannes Thumshirn wrote:
> > > > On 23.01.25 21:30, Viacheslav Dubeyko wrote:
> > > > > [PROBLEM DECLARATION]
> > > > > Efficient data placement policy is a Holy Grail for data
> > > > > storage and file system engineers. Achieving this goal is
> > > > > equally important and really hard. Multiple data storage
> > > > > and file system technologies have been invented to manage
> > > > > the data placement policy (for example, COW, ZNS, FDP, etc).
> > > > > But these technologies still require the hints related to
> > > > > nature of data from application side.
> > > > >
> > > > > [DATA "TEMPERATURE" CONCEPT]
> > > > > One of the widely used and intuitively clear idea of data
> > > > > nature definition is data "temperature" (cold, warm,
> > > > > hot data). However, data "temperature" is as intuitively
> > > > > sound as illusive definition of data nature. Generally
> > > > > speaking, thermodynamics defines temperature as a way
> > > > > to estimate the average kinetic energy of vibrating
> > > > > atoms in a substance. But we cannot see a direct analogy
> > > > > between data "temperature" and temperature in physics
> > > > > because data is not something that has kinetic energy.
> > > > >
> > > > > [WHAT IS GENERALIZED DATA "TEMPERATURE" ESTIMATION]
> > > > > We usually imply that if some data is updated more
> > > > > frequently, then such data is more hot than other one.
> > > > > But, it is possible to see several problems here:
> > > > > (1) How can we estimate the data "hotness" in
> > > > > quantitative way? (2) We can state that data is "hot"
> > > > > after some number of updates. It means that this
> > > > > definition implies state of the data in the past.
> > > > > Will this data continue to be "hot" in the future?
> > > > > Generally speaking, the crucial problem is how to define
> > > > > the data nature or data "temperature" in the future.
> > > > > Because, this knowledge is the fundamental basis for
> > > > > elaboration an efficient data placement policy.
> > > > > Generalized data "temperature" estimation framework
> > > > > suggests the way to define a future state of the data
> > > > > and the basis for quantitative measurement of data
> > > > > "temperature".
> > > > >
> > > > > [ARCHITECTURE OF FRAMEWORK]
> > > > > Usually, file system has a page cache for every inode. And
> > > > > initially memory pages become dirty in page cache. Finally,
> > > > > dirty pages will be sent to storage device. Technically
> > > > > speaking, the number of dirty pages in a particular page
> > > > > cache is the quantitative measurement of current "hotness"
> > > > > of a file. But number of dirty pages is still not stable
> > > > > basis for quantitative measurement of data "temperature".
> > > > > It is possible to suggest of using the total number of
> > > > > logical blocks in a file as a unit of one degree of data
> > > > > "temperature". As a result, if the whole file was updated
> > > > > several times, then "temperature" of the file has been
> > > > > increased for several degrees. And if the file is under
> > > > > continous updates, then the file "temperature" is growing.
> > > > >
> > > > > We need to keep not only current number of dirty pages,
> > > > > but also the number of updated pages in the near past
> > > > > for accumulating the total "temperature" of a file.
> > > > > Generally speaking, total number of updated pages in the
> > > > > nearest past defines the aggregated "temperature" of file.
> > > > > And number of dirty pages defines the delta of
> > > > > "temperature" growth for current update operation.
> > > > > This approach defines the mechanism of "temperature" growth.
> > > > >
> > > > > But if we have no more updates for the file, then
> > > > > "temperature" needs to decrease. Starting and ending
> > > > > timestamps of update operation can work as a basis for
> > > > > decreasing "temperature" of a file. If we know the number
> > > > > of updated logical blocks of the file, then we can divide
> > > > > the duration of update operation on number of updated
> > > > > logical blocks. As a result, this is the way to define
> > > > > a time duration per one logical block. By means of
> > > > > multiplying this value (time duration per one logical
> > > > > block) on total number of logical blocks in file, we
> > > > > can calculate the time duration of "temperature"
> > > > > decreasing for one degree. Finally, the operation of
> > > > > division the time range (between end of last update
> > > > > operation and begin of new update operation) on
> > > > > the time duration of "temperature" decreasing for
> > > > > one degree provides the way to define how many
> > > > > degrees should be subtracted from current "temperature"
> > > > > of the file.
> > > > >
> > > > > [HOW TO USE THE APPROACH]
> > > > > The lifetime of data "temperature" value for a file
> > > > > can be explained by steps: (1) iget() method sets
> > > > > the data "temperature" object; (2) folio_account_dirtied()
> > > > > method accounts the number of dirty memory pages and
> > > > > tries to estimate the current temperature of the file;
> > > > > (3) folio_clear_dirty_for_io() decrease number of dirty
> > > > > memory pages and increases number of updated pages;
> > > > > (4) folio_account_dirtied() also decreases file's
> > > > > "temperature" if updates hasn't happened some time;
> > > > > (5) file system can get file's temperature and
> > > > > to share the hint with block layer; (6) inode
> > > > > eviction method removes and free the data "temperature"
> > > > > object.
> > > >
> > > > I don't want to pour gasoline on old flame wars, but what is the
> > > > advantage of this auto-magic data temperature framework vs the existing
> > > > framework?
> > > >
> > >
> > > There is no magic in this framework. :) It's simple and compact framework.
> > >
> > > >  'enum rw_hint' has temperature in the range of none, short,
> > > > medium, long and extreme (what ever that means), can be set by an
> > > > application via an fcntl() and is plumbed down all the way to the bio
> > > > level by most FSes that care.
> > >
> > > I see your point. But the 'enum rw_hint' defines qualitative grades again:
> > >
> > > enum rw_hint {
> > >         WRITE_LIFE_NOT_SET      = RWH_WRITE_LIFE_NOT_SET,
> > >         WRITE_LIFE_NONE         = RWH_WRITE_LIFE_NONE,
> > >         WRITE_LIFE_SHORT        = RWH_WRITE_LIFE_SHORT,  <-- HOT data
> > >         WRITE_LIFE_MEDIUM       = RWH_WRITE_LIFE_MEDIUM, <-- WARM data
> > >         WRITE_LIFE_LONG         = RWH_WRITE_LIFE_LONG,   <-- COLD data
> > >         WRITE_LIFE_EXTREME      = RWH_WRITE_LIFE_EXTREME,
> > > } __packed;
> > >
> > > First of all, again, it's hard to compare the hotness of different files
> > > on such qualitative basis. Secondly, who decides what is hotness of a particular
> > > data? People can only guess or assume the nature of data based on
> > > experience in the past. But workloads are changing and evolving
> > > continuously and in real-time manner. Technically speaking, application can
> > > try to estimate the hotness of data, but, again, file system can receive
> > > requests from multiple threads and multiple applications. So, application
> > > can guess about real nature of data too. Especially, nobody would like
> > > to implement dedicated logic in application for data hotness estimation.
> > >
> > > This framework is inode based and it tries to estimate file's
> > > "temperature" on quantitative basis. Advantages of this framework:
> > > (1) we don't need to guess about data hotness, temperature will be
> > > calculated quantitatively; (2) quantitative basis gives opportunity
> > > for fair comparison of different files' temperature; (3) file's temperature
> > > will change with workload(s) changing in real-time; (4) file's
> > > temperature will be correctly accounted under the load from multiple
> > > applications. I believe these are advantages of the suggested framework.
> > >
> >
> > While I think the general idea(using file-overwrite-rates as a
> > parameter when doing data placement) could be useful, it could not
> > replace the user space hinting we already have.
> >
> > Applications(e.g. RocksDB) doing sequential writes to files that are
> > immutable until deleted(no overwrites) would not benefit. We need user
> > space help to estimate data lifetime for those workloads and the
> > relative write lifetime hints are useful for that.
> >
>
> I don't see any competition or conflict here. Suggested approach and user-space
> hinting could be complementary techniques. If user-space logic would like to use
> a special data placement policy, then it can share hints in its own way. But,
> potentially, suggested approach of temperature calculation can be used to check
> the effectiveness of the user-space hinting, and, maybe, correcting it. So, I
> don't see any conflict here.

I don't see a conflict here either, my point is just that this
framework cannot replace the user hints.

>
> > So what I am asking myself is if this framework is added, who would
> > benefit? Without any benchmark results it's a bit hard to tell :)
> >
>
> Which benefits would you like to see? I assume we would like: (1) prolong device
> lifetime, (2) improve performance, (3) decrease GC burden. Do you mean these
> benefits?

Yep, decreased write amplification essentially.

>
> As far as I can see, different file systems can use temperature in different
> way. And this is slightly complicates the benchmarking. So, how can we define
> the effectiveness here and how can we measure it? Do you have a vision here? I
> am happy to make more benchmarking.
>
> My point is that the calculated file's temperature gives the quantitative way to
> distribute even user data among several temperature groups ("baskets"). And
> these baskets/segments/anything-else gives the way to properly group data. File
> systems can employ the temperature in various ways, but it can definitely helps
> to elaborate proper data placement policy. As a result, GC burden can be
> decreased, performance can be improved, and lifetime device can be prolong. So,
> how can we benchmark these points? And which approaches make sense to compare?
>

To start off, it would be nice to demonstrate that write amplification
decreases for some workload when the temperature is taken into
account. It would be great if the workload would be an actual
application workload or a synthetic one mimicking some real-world-like
use case.
Run the same workload twice, measure write amplification and compare results.

What user workloads do you see benefiting from this framework? Which would not?

> > Also, is there a good reason for only supporting buffered io? Direct
> > IO could benefit in the same way, right?
> >
>
> I think that Direct IO could benefit too. The question here how to account dirty
> memory pages and updated memory pages. Currently, I am using
> folio_account_dirtied() and folio_clear_dirty_for_io() to implement the
> calculation the temperature. As far as I can see, Direct IO requires another
> methods of doing this. The rest logic can be the same.

It's probably a good idea to cover direct IO as well then as this is
intended to be a generalized framework.
Johannes Thumshirn Jan. 28, 2025, 8:59 a.m. UTC | #8
On 28.01.25 09:45, Hans Holmberg wrote:
>> I think that Direct IO could benefit too. The question here how to account dirty
>> memory pages and updated memory pages. Currently, I am using
>> folio_account_dirtied() and folio_clear_dirty_for_io() to implement the
>> calculation the temperature. As far as I can see, Direct IO requires another
>> methods of doing this. The rest logic can be the same.
> 
> It's probably a good idea to cover direct IO as well then as this is
> intended to be a generalized framework.

Especially given that most applications that really care about data 
lifetimes, write amplification etc are heavy users of direct I/O.
Viacheslav Dubeyko Jan. 28, 2025, 10:30 p.m. UTC | #9
On Tue, 2025-01-28 at 09:45 +0100, Hans Holmberg wrote:
> On Mon, Jan 27, 2025 at 9:59 PM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> > 
> > On Mon, 2025-01-27 at 15:19 +0100, Hans Holmberg wrote:
> > > On Fri, Jan 24, 2025 at 10:03 PM Viacheslav Dubeyko
> > > <Slava.Dubeyko@ibm.com> wrote:
> > > > 
> > > > 

<skipped>

> > > > > > 
> > > > > > [HOW TO USE THE APPROACH]
> > > > > > The lifetime of data "temperature" value for a file
> > > > > > can be explained by steps: (1) iget() method sets
> > > > > > the data "temperature" object; (2) folio_account_dirtied()
> > > > > > method accounts the number of dirty memory pages and
> > > > > > tries to estimate the current temperature of the file;
> > > > > > (3) folio_clear_dirty_for_io() decrease number of dirty
> > > > > > memory pages and increases number of updated pages;
> > > > > > (4) folio_account_dirtied() also decreases file's
> > > > > > "temperature" if updates hasn't happened some time;
> > > > > > (5) file system can get file's temperature and
> > > > > > to share the hint with block layer; (6) inode
> > > > > > eviction method removes and free the data "temperature"
> > > > > > object.
> > > > > 
> > > > > I don't want to pour gasoline on old flame wars, but what is the
> > > > > advantage of this auto-magic data temperature framework vs the existing
> > > > > framework?
> > > > > 
> > > > 
> > > > There is no magic in this framework. :) It's simple and compact framework.
> > > > 
> > > > >  'enum rw_hint' has temperature in the range of none, short,
> > > > > medium, long and extreme (what ever that means), can be set by an
> > > > > application via an fcntl() and is plumbed down all the way to the bio
> > > > > level by most FSes that care.
> > > > 
> > > > I see your point. But the 'enum rw_hint' defines qualitative grades again:
> > > > 
> > > > enum rw_hint {
> > > >         WRITE_LIFE_NOT_SET      = RWH_WRITE_LIFE_NOT_SET,
> > > >         WRITE_LIFE_NONE         = RWH_WRITE_LIFE_NONE,
> > > >         WRITE_LIFE_SHORT        = RWH_WRITE_LIFE_SHORT,  <-- HOT data
> > > >         WRITE_LIFE_MEDIUM       = RWH_WRITE_LIFE_MEDIUM, <-- WARM data
> > > >         WRITE_LIFE_LONG         = RWH_WRITE_LIFE_LONG,   <-- COLD data
> > > >         WRITE_LIFE_EXTREME      = RWH_WRITE_LIFE_EXTREME,
> > > > } __packed;
> > > > 
> > > > First of all, again, it's hard to compare the hotness of different files
> > > > on such qualitative basis. Secondly, who decides what is hotness of a particular
> > > > data? People can only guess or assume the nature of data based on
> > > > experience in the past. But workloads are changing and evolving
> > > > continuously and in real-time manner. Technically speaking, application can
> > > > try to estimate the hotness of data, but, again, file system can receive
> > > > requests from multiple threads and multiple applications. So, application
> > > > can guess about real nature of data too. Especially, nobody would like
> > > > to implement dedicated logic in application for data hotness estimation.
> > > > 
> > > > This framework is inode based and it tries to estimate file's
> > > > "temperature" on quantitative basis. Advantages of this framework:
> > > > (1) we don't need to guess about data hotness, temperature will be
> > > > calculated quantitatively; (2) quantitative basis gives opportunity
> > > > for fair comparison of different files' temperature; (3) file's temperature
> > > > will change with workload(s) changing in real-time; (4) file's
> > > > temperature will be correctly accounted under the load from multiple
> > > > applications. I believe these are advantages of the suggested framework.
> > > > 
> > > 
> > > While I think the general idea(using file-overwrite-rates as a
> > > parameter when doing data placement) could be useful, it could not
> > > replace the user space hinting we already have.
> > > 
> > > Applications(e.g. RocksDB) doing sequential writes to files that are
> > > immutable until deleted(no overwrites) would not benefit. We need user
> > > space help to estimate data lifetime for those workloads and the
> > > relative write lifetime hints are useful for that.
> > > 
> > 
> > I don't see any competition or conflict here. Suggested approach and user-space
> > hinting could be complementary techniques. If user-space logic would like to use
> > a special data placement policy, then it can share hints in its own way. But,
> > potentially, suggested approach of temperature calculation can be used to check
> > the effectiveness of the user-space hinting, and, maybe, correcting it. So, I
> > don't see any conflict here.
> 
> I don't see a conflict here either, my point is just that this
> framework cannot replace the user hints.
> 

I have no intentions to replace any existing techniques. :)

> > 
> > > So what I am asking myself is if this framework is added, who would
> > > benefit? Without any benchmark results it's a bit hard to tell :)
> > > 
> > 
> > Which benefits would you like to see? I assume we would like: (1) prolong device
> > lifetime, (2) improve performance, (3) decrease GC burden. Do you mean these
> > benefits?
> 
> Yep, decreased write amplification essentially.
> 

The important point here that the suggested framework offers only means to
estimate temperature. But only file system technique can decrease or increase
write amplification. So, we need to compare apples with apples. As far as I
know, F2FS has algorithm of estimation and employing temperature. Do you imply
F2FS or how do you see the way of estimation the write amplification decreasing?
Because, every file system should have own way to employ temperature.

> > 
> > As far as I can see, different file systems can use temperature in different
> > way. And this is slightly complicates the benchmarking. So, how can we define
> > the effectiveness here and how can we measure it? Do you have a vision here? I
> > am happy to make more benchmarking.
> > 
> > My point is that the calculated file's temperature gives the quantitative way to
> > distribute even user data among several temperature groups ("baskets"). And
> > these baskets/segments/anything-else gives the way to properly group data. File
> > systems can employ the temperature in various ways, but it can definitely helps
> > to elaborate proper data placement policy. As a result, GC burden can be
> > decreased, performance can be improved, and lifetime device can be prolong. So,
> > how can we benchmark these points? And which approaches make sense to compare?
> > 
> 
> To start off, it would be nice to demonstrate that write amplification
> decreases for some workload when the temperature is taken into
> account. It would be great if the workload would be an actual
> application workload or a synthetic one mimicking some real-world-like
> use case.
> Run the same workload twice, measure write amplification and compare results.
> 

Another trouble here. What is the way to measure write amplification, from your
point of view? Which benchmarking tool or framework do you suggest for write
amplification estimation?

> What user workloads do you see benefiting from this framework? Which would not?
> 

We need to talk at first about file system mechanism to employ data temperature
in efficient way. Because there is no universal way to employ data temperature
and different file system can implement completely different techniques. And
only then it will be possible to estimate which file system can provides
benefits for a particular workload. Suggested framework only estimates the
temperature.

> > > Also, is there a good reason for only supporting buffered io? Direct
> > > IO could benefit in the same way, right?
> > > 
> > 
> > I think that Direct IO could benefit too. The question here how to account dirty
> > memory pages and updated memory pages. Currently, I am using
> > folio_account_dirtied() and folio_clear_dirty_for_io() to implement the
> > calculation the temperature. As far as I can see, Direct IO requires another
> > methods of doing this. The rest logic can be the same.
> 
> It's probably a good idea to cover direct IO as well then as this is
> intended to be a generalized framework.

To cover Direct IO is a good point. But even page cache based approach makes
sense because LFS and GC based file systems needs to manage data in efficient
way. By the way, do you have a vision which methods can be used for the case of
Direct IO to account dirty and updated memory pages?

Thanks,
Slava.
Viacheslav Dubeyko Jan. 28, 2025, 10:35 p.m. UTC | #10
On Tue, 2025-01-28 at 08:59 +0000, Johannes Thumshirn wrote:
> On 28.01.25 09:45, Hans Holmberg wrote:
> > > I think that Direct IO could benefit too. The question here how to account dirty
> > > memory pages and updated memory pages. Currently, I am using
> > > folio_account_dirtied() and folio_clear_dirty_for_io() to implement the
> > > calculation the temperature. As far as I can see, Direct IO requires another
> > > methods of doing this. The rest logic can be the same.
> > 
> > It's probably a good idea to cover direct IO as well then as this is
> > intended to be a generalized framework.
> 
> Especially given that most applications that really care about data 
> lifetimes, write amplification etc are heavy users of direct I/O.

I believe smartphones is really huge use-case. And LFS and GC based file
systems are used there. So, page cache based approach makes sense for such
file systems to manage data placement policy efficiently.

I like this suggestion related to Direct IO case. But it needs to elaborate
the way to proper manage dirty and updated memory pages calculation for
Direct IO case.

Thanks,
Slava.
Hans Holmberg Jan. 29, 2025, 10:23 a.m. UTC | #11
On Tue, Jan 28, 2025 at 11:31 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Tue, 2025-01-28 at 09:45 +0100, Hans Holmberg wrote:
> > On Mon, Jan 27, 2025 at 9:59 PM Viacheslav Dubeyko
> > <Slava.Dubeyko@ibm.com> wrote:
> > >
> > > On Mon, 2025-01-27 at 15:19 +0100, Hans Holmberg wrote:
> > > > On Fri, Jan 24, 2025 at 10:03 PM Viacheslav Dubeyko
> > > > <Slava.Dubeyko@ibm.com> wrote:
> > > > >
> > > > >
> > >
> > > > So what I am asking myself is if this framework is added, who would
> > > > benefit? Without any benchmark results it's a bit hard to tell :)
> > > >
> > >
> > > Which benefits would you like to see? I assume we would like: (1) prolong device
> > > lifetime, (2) improve performance, (3) decrease GC burden. Do you mean these
> > > benefits?
> >
> > Yep, decreased write amplification essentially.
> >
>
> The important point here that the suggested framework offers only means to
> estimate temperature. But only file system technique can decrease or increase
> write amplification. So, we need to compare apples with apples. As far as I
> know, F2FS has algorithm of estimation and employing temperature. Do you imply
> F2FS or how do you see the way of estimation the write amplification decreasing?
> Because, every file system should have own way to employ temperature.

If you could show that this framework can decrease write amplification
in ssdfs, f2fs or
any other file system, I think that would be a good start.

Compare using your generated temperatures vs not using the temperature info.

>
> > >
> > > As far as I can see, different file systems can use temperature in different
> > > way. And this is slightly complicates the benchmarking. So, how can we define
> > > the effectiveness here and how can we measure it? Do you have a vision here? I
> > > am happy to make more benchmarking.
> > >
> > > My point is that the calculated file's temperature gives the quantitative way to
> > > distribute even user data among several temperature groups ("baskets"). And
> > > these baskets/segments/anything-else gives the way to properly group data. File
> > > systems can employ the temperature in various ways, but it can definitely helps
> > > to elaborate proper data placement policy. As a result, GC burden can be
> > > decreased, performance can be improved, and lifetime device can be prolong. So,
> > > how can we benchmark these points? And which approaches make sense to compare?
> > >
> >
> > To start off, it would be nice to demonstrate that write amplification
> > decreases for some workload when the temperature is taken into
> > account. It would be great if the workload would be an actual
> > application workload or a synthetic one mimicking some real-world-like
> > use case.
> > Run the same workload twice, measure write amplification and compare results.
> >
>
> Another trouble here. What is the way to measure write amplification, from your
> point of view? Which benchmarking tool or framework do you suggest for write
> amplification estimation?

FDP drives expose this information. You can retrieve the stats using
the nvme cli.
If you are using zoned storage, you can add write amp metrics inside
the file system
or just measure the amount of blocks written to the device using iostat.

> > > > Also, is there a good reason for only supporting buffered io? Direct
> > > > IO could benefit in the same way, right?
> > > >
> > >
> > > I think that Direct IO could benefit too. The question here how to account dirty
> > > memory pages and updated memory pages. Currently, I am using
> > > folio_account_dirtied() and folio_clear_dirty_for_io() to implement the
> > > calculation the temperature. As far as I can see, Direct IO requires another
> > > methods of doing this. The rest logic can be the same.
> >
> > It's probably a good idea to cover direct IO as well then as this is
> > intended to be a generalized framework.
>
> To cover Direct IO is a good point. But even page cache based approach makes
> sense because LFS and GC based file systems needs to manage data in efficient
> way. By the way, do you have a vision which methods can be used for the case of
> Direct IO to account dirty and updated memory pages?
>

Temperature feedback could instead be provided by file systems that
would actually
care about using the information.
diff mbox series

Patch

diff --git a/fs/Kconfig b/fs/Kconfig
index 64d420e3c475..ae117c2e3ce2 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -139,6 +139,8 @@  source "fs/autofs/Kconfig"
 source "fs/fuse/Kconfig"
 source "fs/overlayfs/Kconfig"
 
+source "fs/data-temperature/Kconfig"
+
 menu "Caches"
 
 source "fs/netfs/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 15df0a923d3a..c7e6ccac633d 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -129,3 +129,4 @@  obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
 obj-$(CONFIG_BPF_LSM)		+= bpf_fs_kfuncs.o
+obj-$(CONFIG_DATA_TEMPERATURE)	+= data-temperature/
diff --git a/fs/data-temperature/Kconfig b/fs/data-temperature/Kconfig
new file mode 100644
index 000000000000..1cade2741982
--- /dev/null
+++ b/fs/data-temperature/Kconfig
@@ -0,0 +1,11 @@ 
+# SPDX-License-Identifier: GPL-2.0
+
+config DATA_TEMPERATURE
+	bool "Data temperature approach for efficient data placement"
+	help
+	  Enable data "temperature" estimation for efficient data
+	  placement policy. This approach is file based and
+	  it estimates "temperature" for every file independently.
+	  The goal of the approach is to provide valuable hints
+	  to file system or/and SSD for isolation and proper
+	  managament of data with different temperatures.
diff --git a/fs/data-temperature/Makefile b/fs/data-temperature/Makefile
new file mode 100644
index 000000000000..8e089a681360
--- /dev/null
+++ b/fs/data-temperature/Makefile
@@ -0,0 +1,3 @@ 
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_DATA_TEMPERATURE) += data_temperature.o
diff --git a/fs/data-temperature/data_temperature.c b/fs/data-temperature/data_temperature.c
new file mode 100644
index 000000000000..ea43fbfc3976
--- /dev/null
+++ b/fs/data-temperature/data_temperature.c
@@ -0,0 +1,347 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Data "temperature" paradigm implementation
+ *
+ * Copyright (c) 2024-2025 Viacheslav Dubeyko <slava@dubeyko.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pagemap.h>
+#include <linux/data_temperature.h>
+#include <linux/fs.h>
+
+#define TIME_IS_UNKNOWN		(U64_MAX)
+
+struct kmem_cache *data_temperature_info_cachep;
+
+static inline
+void create_data_temperature_info(struct data_temperature *dt_info)
+{
+	if (!dt_info)
+		return;
+
+	atomic_set(&dt_info->temperature, 0);
+	dt_info->updated_blocks = 0;
+	dt_info->dirty_blocks = 0;
+	dt_info->start_timestamp = TIME_IS_UNKNOWN;
+	dt_info->end_timestamp = TIME_IS_UNKNOWN;
+	dt_info->state = DATA_TEMPERATURE_CREATED;
+}
+
+static inline
+void free_data_temperature_info(struct data_temperature *dt_info)
+{
+	if (!dt_info)
+		return;
+
+	kmem_cache_free(data_temperature_info_cachep, dt_info);
+}
+
+int __set_data_temperature_info(struct inode *inode)
+{
+	struct data_temperature *dt_info;
+
+	dt_info = kmem_cache_zalloc(data_temperature_info_cachep, GFP_KERNEL);
+	if (!dt_info)
+		return -ENOMEM;
+
+	spin_lock_init(&dt_info->change_lock);
+	create_data_temperature_info(dt_info);
+
+	if (cmpxchg_release(&inode->i_data_temperature_info,
+					NULL, dt_info) != NULL) {
+		free_data_temperature_info(dt_info);
+		get_data_temperature_info(inode);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(__set_data_temperature_info);
+
+void __remove_data_temperature_info(struct inode *inode)
+{
+	free_data_temperature_info(inode->i_data_temperature_info);
+	inode->i_data_temperature_info = NULL;
+}
+EXPORT_SYMBOL_GPL(__remove_data_temperature_info);
+
+int __get_data_temperature(const struct inode *inode)
+{
+	struct data_temperature *dt_info;
+
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+
+	dt_info = get_data_temperature_info(inode);
+	if (IS_ERR_OR_NULL(dt_info))
+		return 0;
+
+	return atomic_read(&dt_info->temperature);
+}
+EXPORT_SYMBOL_GPL(__get_data_temperature);
+
+static inline
+bool is_timestamp_invalid(struct data_temperature *dt_info)
+{
+	if (!dt_info)
+		return false;
+
+	if (dt_info->start_timestamp == TIME_IS_UNKNOWN ||
+	    dt_info->end_timestamp == TIME_IS_UNKNOWN)
+		return true;
+
+	if (dt_info->start_timestamp > dt_info->end_timestamp)
+		return true;
+
+	return false;
+}
+
+static inline
+u64 get_current_timestamp(void)
+{
+	return ktime_get_boottime_ns();
+}
+
+static inline
+void start_account_data_temperature_info(struct data_temperature *dt_info)
+{
+	if (!dt_info)
+		return;
+
+	dt_info->dirty_blocks = 1;
+	dt_info->start_timestamp = get_current_timestamp();
+	dt_info->end_timestamp = TIME_IS_UNKNOWN;
+	dt_info->state = DATA_TEMPERATURE_UPDATE_STARTED;
+}
+
+static inline
+void __increase_data_temperature(struct inode *inode,
+				 struct data_temperature *dt_info)
+{
+	u64 bytes_count;
+	u64 file_blocks;
+	u32 block_bytes;
+	int dirty_blocks_ratio;
+	int updated_blocks_ratio;
+	int old_temperature;
+	int calculated;
+
+	if (!inode || !dt_info)
+		return;
+
+	block_bytes = 1 << inode->i_blkbits;
+	bytes_count = i_size_read(inode) + block_bytes - 1;
+	file_blocks = bytes_count >> inode->i_blkbits;
+
+	dt_info->dirty_blocks++;
+
+	if (file_blocks > 0) {
+		old_temperature = atomic_read(&dt_info->temperature);
+
+		dirty_blocks_ratio = div_u64(dt_info->dirty_blocks,
+						file_blocks);
+		updated_blocks_ratio = div_u64(dt_info->updated_blocks,
+						file_blocks);
+		calculated = max_t(int, dirty_blocks_ratio,
+					updated_blocks_ratio);
+
+		if (calculated > 0 && old_temperature < calculated)
+			atomic_set(&dt_info->temperature, calculated);
+	}
+}
+
+static inline
+void __decrease_data_temperature(struct inode *inode,
+				 struct data_temperature *dt_info)
+{
+	u64 timestamp;
+	u64 time_range;
+	u64 time_diff;
+	u64 bytes_count;
+	u64 file_blocks;
+	u32 block_bytes;
+	u64 blks_per_temperature_degree;
+	u64 ns_per_block;
+	u64 temperature_diff;
+
+	if (!inode || !dt_info)
+		return;
+
+	if (is_timestamp_invalid(dt_info)) {
+		create_data_temperature_info(dt_info);
+		return;
+	}
+
+	timestamp = get_current_timestamp();
+
+	if (dt_info->end_timestamp > timestamp) {
+		create_data_temperature_info(dt_info);
+		return;
+	}
+
+	time_range = dt_info->end_timestamp - dt_info->start_timestamp;
+	time_diff = timestamp - dt_info->end_timestamp;
+
+	block_bytes = 1 << inode->i_blkbits;
+	bytes_count = i_size_read(inode) + block_bytes - 1;
+	file_blocks = bytes_count >> inode->i_blkbits;
+
+	blks_per_temperature_degree = file_blocks;
+	if (blks_per_temperature_degree == 0) {
+		start_account_data_temperature_info(dt_info);
+		return;
+	}
+
+	if (dt_info->updated_blocks == 0 || time_range == 0) {
+		start_account_data_temperature_info(dt_info);
+		return;
+	}
+
+	ns_per_block = div_u64(time_range, dt_info->updated_blocks);
+	if (ns_per_block == 0)
+		ns_per_block = 1;
+
+	if (time_diff == 0) {
+		start_account_data_temperature_info(dt_info);
+		return;
+	}
+
+	temperature_diff = div_u64(time_diff, ns_per_block);
+	temperature_diff = div_u64(temperature_diff,
+					blks_per_temperature_degree);
+
+	if (temperature_diff == 0)
+		return;
+
+	if (temperature_diff <= atomic_read(&dt_info->temperature)) {
+		atomic_sub(temperature_diff, &dt_info->temperature);
+		dt_info->updated_blocks -=
+			temperature_diff * blks_per_temperature_degree;
+	} else {
+		atomic_set(&dt_info->temperature, 0);
+		dt_info->updated_blocks = 0;
+	}
+}
+
+int __increase_data_temperature_by_dirty_folio(struct folio *folio)
+{
+	struct inode *inode;
+	struct data_temperature *dt_info;
+
+	if (!folio || !folio->mapping)
+		return 0;
+
+	inode = folio_inode(folio);
+
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+
+	dt_info = get_data_temperature_info(inode);
+	if (IS_ERR_OR_NULL(dt_info))
+		return 0;
+
+	spin_lock(&dt_info->change_lock);
+	switch (dt_info->state) {
+	case DATA_TEMPERATURE_CREATED:
+		atomic_set(&dt_info->temperature, 0);
+		start_account_data_temperature_info(dt_info);
+		break;
+
+	case DATA_TEMPERATURE_UPDATE_STARTED:
+		__increase_data_temperature(inode, dt_info);
+		break;
+
+	case DATA_TEMPERATURE_UPDATE_FINISHED:
+		__decrease_data_temperature(inode, dt_info);
+		start_account_data_temperature_info(dt_info);
+		break;
+
+	default:
+		/* do nothing */
+		break;
+	}
+	spin_unlock(&dt_info->change_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(__increase_data_temperature_by_dirty_folio);
+
+static inline
+void decrement_dirty_blocks(struct data_temperature *dt_info)
+{
+	if (!dt_info)
+		return;
+
+	if (dt_info->dirty_blocks > 0) {
+		dt_info->dirty_blocks--;
+		dt_info->updated_blocks++;
+	}
+}
+
+static inline
+void finish_increasing_data_temperature(struct data_temperature *dt_info)
+{
+	if (!dt_info)
+		return;
+
+	if (dt_info->dirty_blocks == 0) {
+		dt_info->end_timestamp = get_current_timestamp();
+		dt_info->state = DATA_TEMPERATURE_UPDATE_FINISHED;
+	}
+}
+
+int __account_flushed_folio_by_data_temperature(struct folio *folio)
+{
+	struct inode *inode;
+	struct data_temperature *dt_info;
+
+	if (!folio || !folio->mapping)
+		return 0;
+
+	inode = folio_inode(folio);
+
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+
+	dt_info = get_data_temperature_info(inode);
+	if (IS_ERR_OR_NULL(dt_info))
+		return 0;
+
+	spin_lock(&dt_info->change_lock);
+	switch (dt_info->state) {
+	case DATA_TEMPERATURE_CREATED:
+		create_data_temperature_info(dt_info);
+		break;
+
+	case DATA_TEMPERATURE_UPDATE_STARTED:
+		if (dt_info->dirty_blocks > 0)
+			decrement_dirty_blocks(dt_info);
+		if (dt_info->dirty_blocks == 0)
+			finish_increasing_data_temperature(dt_info);
+		break;
+
+	case DATA_TEMPERATURE_UPDATE_FINISHED:
+		/* do nothing */
+		break;
+
+	default:
+		/* do nothing */
+		break;
+	}
+	spin_unlock(&dt_info->change_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(__account_flushed_folio_by_data_temperature);
+
+static int __init data_temperature_init(void)
+{
+	data_temperature_info_cachep = KMEM_CACHE(data_temperature,
+						  SLAB_RECLAIM_ACCOUNT);
+	if (!data_temperature_info_cachep)
+		return -ENOMEM;
+
+	return 0;
+}
+late_initcall(data_temperature_init)
diff --git a/include/linux/data_temperature.h b/include/linux/data_temperature.h
new file mode 100644
index 000000000000..40abf6322385
--- /dev/null
+++ b/include/linux/data_temperature.h
@@ -0,0 +1,124 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Data "temperature" paradigm declarations
+ *
+ * Copyright (c) 2024-2025 Viacheslav Dubeyko <slava@dubeyko.com>
+ */
+
+#ifndef _LINUX_DATA_TEMPERATURE_H
+#define _LINUX_DATA_TEMPERATURE_H
+
+/*
+ * struct data_temperature - data temperature definition
+ * @temperature: current temperature of a file
+ * @change_lock: modification lock
+ * @state: current state of data temperature object
+ * @dirty_blocks: current number of dirty blocks in page cache
+ * @updated_blocks: number of updated blocks [start_timestamp, end_timestamp]
+ * @start_timestamp: starting timestamp of update operations
+ * @end_timestamp: finishing timestamp of update operations
+ */
+struct data_temperature {
+	atomic_t temperature;
+
+	spinlock_t change_lock;
+	int state;
+	u64 dirty_blocks;
+	u64 updated_blocks;
+	u64 start_timestamp;
+	u64 end_timestamp;
+};
+
+enum data_temperature_state {
+	DATA_TEMPERATURE_UNKNOWN_STATE,
+	DATA_TEMPERATURE_CREATED,
+	DATA_TEMPERATURE_UPDATE_STARTED,
+	DATA_TEMPERATURE_UPDATE_FINISHED,
+	DATA_TEMPERATURE_STATE_MAX
+};
+
+#ifdef CONFIG_DATA_TEMPERATURE
+
+int __set_data_temperature_info(struct inode *inode);
+void __remove_data_temperature_info(struct inode *inode);
+int __get_data_temperature(const struct inode *inode);
+int __increase_data_temperature_by_dirty_folio(struct folio *folio);
+int __account_flushed_folio_by_data_temperature(struct folio *folio);
+
+static inline
+struct data_temperature *get_data_temperature_info(const struct inode *inode)
+{
+	return smp_load_acquire(&inode->i_data_temperature_info);
+}
+
+static inline
+int set_data_temperature_info(struct inode *inode)
+{
+	return __set_data_temperature_info(inode);
+}
+
+static inline
+void remove_data_temperature_info(struct inode *inode)
+{
+	__remove_data_temperature_info(inode);
+}
+
+static inline
+int get_data_temperature(const struct inode *inode)
+{
+	return __get_data_temperature(inode);
+}
+
+static inline
+int increase_data_temperature_by_dirty_folio(struct folio *folio)
+{
+	return __increase_data_temperature_by_dirty_folio(folio);
+}
+
+static inline
+int account_flushed_folio_by_data_temperature(struct folio *folio)
+{
+	return __account_flushed_folio_by_data_temperature(folio);
+}
+
+#else  /* !CONFIG_DATA_TEMPERATURE */
+
+static inline
+int set_data_temperature_info(struct inode *inode)
+{
+	return 0;
+}
+
+static inline
+void remove_data_temperature_info(struct inode *inode)
+{
+	return;
+}
+
+static inline
+struct data_temperature *get_data_temperature_info(const struct inode *inode)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline
+int get_data_temperature(const struct inode *inode)
+{
+	return 0;
+}
+
+static inline
+int increase_data_temperature_by_dirty_folio(struct folio *folio)
+{
+	return 0;
+}
+
+static inline
+int account_flushed_folio_by_data_temperature(struct folio *folio)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_DATA_TEMPERATURE */
+
+#endif	/* _LINUX_DATA_TEMPERATURE_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a4af70367f8a..57c4810a28a0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -753,6 +753,10 @@  struct inode {
 	struct fsverity_info	*i_verity_info;
 #endif
 
+#ifdef CONFIG_DATA_TEMPERATURE
+	struct data_temperature		*i_data_temperature_info;
+#endif
+
 	void			*i_private; /* fs or device private pointer */
 } __randomize_layout;
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d9861e42b2bd..5de458b7fefc 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -38,6 +38,7 @@ 
 #include <linux/sched/rt.h>
 #include <linux/sched/signal.h>
 #include <linux/mm_inline.h>
+#include <linux/data_temperature.h>
 #include <trace/events/writeback.h>
 
 #include "internal.h"
@@ -2775,6 +2776,10 @@  static void folio_account_dirtied(struct folio *folio,
 		__this_cpu_add(bdp_ratelimits, nr);
 
 		mem_cgroup_track_foreign_dirty(folio, wb);
+
+#ifdef CONFIG_DATA_TEMPERATURE
+		increase_data_temperature_by_dirty_folio(folio);
+#endif	/* CONFIG_DATA_TEMPERATURE */
 	}
 }
 
@@ -3006,6 +3011,10 @@  bool folio_clear_dirty_for_io(struct folio *folio)
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 
+#ifdef CONFIG_DATA_TEMPERATURE
+	account_flushed_folio_by_data_temperature(folio);
+#endif	/* CONFIG_DATA_TEMPERATURE */
+
 	if (mapping && mapping_can_writeback(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;