diff mbox series

[v2,1/2] hugetlbfs: support tracepoint

Message ID 20240704030704.2289667-2-lihongbo22@huawei.com (mailing list archive)
State New
Headers show
Series Introduce tracepoint for hugetlbfs | expand

Commit Message

Hongbo Li July 4, 2024, 3:07 a.m. UTC
Add basic tracepoints for {alloc, evict, free}_inode, setattr and
fallocate. These can help users to debug hugetlbfs more conveniently.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
---
 MAINTAINERS                      |   1 +
 include/trace/events/hugetlbfs.h | 160 +++++++++++++++++++++++++++++++
 2 files changed, 161 insertions(+)
 create mode 100644 include/trace/events/hugetlbfs.h

Comments

Matthew Wilcox (Oracle) July 4, 2024, 3:37 a.m. UTC | #1
On Thu, Jul 04, 2024 at 11:07:03AM +0800, Hongbo Li wrote:
> +	TP_printk("dev = (%d,%d), ino = %lu, dir = %lu, mode = 0%o",
> +		MAJOR(__entry->dev), MINOR(__entry->dev),
> +		(unsigned long) __entry->ino,
> +		(unsigned long) __entry->dir, __entry->mode)

erofs and f2fs are the only two places that print devices like this.

	"dev=%d:%d inode=%lx"

Why do we need dir and mode?

Actually, why do we need a tracepoint on alloc_inode at all?  What
does it help us debug, and why does no other filesystem need an
alloc_inode tracepoint?
Hongbo Li July 4, 2024, 6:40 a.m. UTC | #2
On 2024/7/4 11:37, Matthew Wilcox wrote:
> On Thu, Jul 04, 2024 at 11:07:03AM +0800, Hongbo Li wrote:
>> +	TP_printk("dev = (%d,%d), ino = %lu, dir = %lu, mode = 0%o",
>> +		MAJOR(__entry->dev), MINOR(__entry->dev),
>> +		(unsigned long) __entry->ino,
>> +		(unsigned long) __entry->dir, __entry->mode)
> 
> erofs and f2fs are the only two places that print devices like this.
> 
> 	"dev=%d:%d inode=%lx"
> 
> Why do we need dir and mode?
Thanks for reviewing!

Here dir and mode are used to track the creation of the directory tree.
> 
> Actually, why do we need a tracepoint on alloc_inode at all?  What
> does it help us debug, and why does no other filesystem need an
> alloc_inode tracepoint?
> 
In fact, f2fs and ext4 have added this tracepoint such as 
trace_f2fs_new_inode(in f2fs) and trace_ext4_allocate_inode(in ext4). 
This can trace the lifecycle of an inode comprehensively. These 
tracepoints are used to debug some closed application scenarios, and 
also helping developers to debug the filesystem logic in hugetlbfs.

Thanks,
Hongbo
Dave Chinner July 4, 2024, 12:56 p.m. UTC | #3
On Thu, Jul 04, 2024 at 04:37:30AM +0100, Matthew Wilcox wrote:
> On Thu, Jul 04, 2024 at 11:07:03AM +0800, Hongbo Li wrote:
> > +	TP_printk("dev = (%d,%d), ino = %lu, dir = %lu, mode = 0%o",
> > +		MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		(unsigned long) __entry->ino,
> > +		(unsigned long) __entry->dir, __entry->mode)
> 
> erofs and f2fs are the only two places that print devices like this.
> 
> 	"dev=%d:%d inode=%lx"

"dev %d:%d inode %lx"

i.e. every token in the output should be space separated, and no
commas between values.

Any other format makes it difficult for post processing tracepoint
output with sed, grep, awk, python, etc. Every token then has to be
split into name and value parts, and then the value has to have the
comma stripped from it.

Having to do this is additional work when writing use-once scripts
that get thrown away when the tracepoint output analysis is done
is painful, and it's completely unnecessary if the tracepoint output
is completely space separated from the start.

-Dave.
Steven Rostedt July 4, 2024, 2:13 p.m. UTC | #4
On Thu, 4 Jul 2024 22:56:29 +1000
Dave Chinner <david@fromorbit.com> wrote:

> Having to do this is additional work when writing use-once scripts
> that get thrown away when the tracepoint output analysis is done
> is painful, and it's completely unnecessary if the tracepoint output
> is completely space separated from the start.

If you are using scripts to parse the output, then you could just
enable the "fields" options, which will just ignore the TP_printk() and
print the fields in both their hex and decimal values:

 # trace-cmd start -e filemap -O fields

// the above fields change can also be done with:
//  echo 1 > /sys/kernel/tracing/options/fields

 # trace-cmd show
# tracer: nop
#
# entries-in-buffer/entries-written: 8/8   #P:8
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
            less-2527    [004] ..... 61949.896458: mm_filemap_add_to_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)
            less-2527    [004] d..2. 61949.896926: mm_filemap_delete_from_page_cache: pfn=0x152b07 (1387271) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)
     jbd2/vda3-8-268     [005] ..... 61954.461964: mm_filemap_add_to_page_cache: pfn=0x152b70 (1387376) i_ino=0xfe00003 (266338307) index=0x30bd33 (3194163) s_dev=0x3 (3) order=(0)
     jbd2/vda3-8-268     [005] ..... 61954.462669: mm_filemap_add_to_page_cache: pfn=0x15335b (1389403) i_ino=0xfe00003 (266338307) index=0x30bd40 (3194176) s_dev=0x3 (3) order=(0)
     jbd2/vda3-8-268     [005] ..... 62001.565391: mm_filemap_add_to_page_cache: pfn=0x13a996 (1288598) i_ino=0xfe00003 (266338307) index=0x30bd41 (3194177) s_dev=0x3 (3) order=(0)
     jbd2/vda3-8-268     [005] ..... 62001.566081: mm_filemap_add_to_page_cache: pfn=0x1446b5 (1328821) i_ino=0xfe00003 (266338307) index=0x30bd43 (3194179) s_dev=0x3 (3) order=(0)
            less-2530    [004] ..... 62033.182309: mm_filemap_add_to_page_cache: pfn=0x13d755 (1300309) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)
            less-2530    [004] d..2. 62033.182801: mm_filemap_delete_from_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)


Just an FYI,

-- Steve
Dave Chinner July 4, 2024, 11:34 p.m. UTC | #5
On Thu, Jul 04, 2024 at 10:13:22AM -0400, Steven Rostedt wrote:
> On Thu, 4 Jul 2024 22:56:29 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > Having to do this is additional work when writing use-once scripts
> > that get thrown away when the tracepoint output analysis is done
> > is painful, and it's completely unnecessary if the tracepoint output
> > is completely space separated from the start.
> 
> If you are using scripts to parse the output, then you could just
> enable the "fields" options, which will just ignore the TP_printk() and
> print the fields in both their hex and decimal values:
> 
>  # trace-cmd start -e filemap -O fields
> 
> // the above fields change can also be done with:
> //  echo 1 > /sys/kernel/tracing/options/fields
> 
>  # trace-cmd show
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 8/8   #P:8
> #
> #                                _-----=> irqs-off/BH-disabled
> #                               / _----=> need-resched
> #                              | / _---=> hardirq/softirq
> #                              || / _--=> preempt-depth
> #                              ||| / _-=> migrate-disable
> #                              |||| /     delay
> #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
> #              | |         |   |||||     |         |
>             less-2527    [004] ..... 61949.896458: mm_filemap_add_to_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)
>             less-2527    [004] d..2. 61949.896926: mm_filemap_delete_from_page_cache: pfn=0x152b07 (1387271) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)
>      jbd2/vda3-8-268     [005] ..... 61954.461964: mm_filemap_add_to_page_cache: pfn=0x152b70 (1387376) i_ino=0xfe00003 (266338307) index=0x30bd33 (3194163) s_dev=0x3 (3) order=(0)
>      jbd2/vda3-8-268     [005] ..... 61954.462669: mm_filemap_add_to_page_cache: pfn=0x15335b (1389403) i_ino=0xfe00003 (266338307) index=0x30bd40 (3194176) s_dev=0x3 (3) order=(0)
>      jbd2/vda3-8-268     [005] ..... 62001.565391: mm_filemap_add_to_page_cache: pfn=0x13a996 (1288598) i_ino=0xfe00003 (266338307) index=0x30bd41 (3194177) s_dev=0x3 (3) order=(0)
>      jbd2/vda3-8-268     [005] ..... 62001.566081: mm_filemap_add_to_page_cache: pfn=0x1446b5 (1328821) i_ino=0xfe00003 (266338307) index=0x30bd43 (3194179) s_dev=0x3 (3) order=(0)
>             less-2530    [004] ..... 62033.182309: mm_filemap_add_to_page_cache: pfn=0x13d755 (1300309) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)
>             less-2530    [004] d..2. 62033.182801: mm_filemap_delete_from_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)

Yes, I know about that. But this just makes things harder, because
now there are *3* different formats that have to be handled (i.e.
now we also have to strip "()" around numbers).

-Dave.
Hongbo Li July 5, 2024, 6:24 a.m. UTC | #6
On 2024/7/5 7:34, Dave Chinner wrote:
> On Thu, Jul 04, 2024 at 10:13:22AM -0400, Steven Rostedt wrote:
>> On Thu, 4 Jul 2024 22:56:29 +1000
>> Dave Chinner <david@fromorbit.com> wrote:
>>
>>> Having to do this is additional work when writing use-once scripts
>>> that get thrown away when the tracepoint output analysis is done
>>> is painful, and it's completely unnecessary if the tracepoint output
>>> is completely space separated from the start.
>>
>> If you are using scripts to parse the output, then you could just
>> enable the "fields" options, which will just ignore the TP_printk() and
>> print the fields in both their hex and decimal values:
>>
>>   # trace-cmd start -e filemap -O fields
>>
>> // the above fields change can also be done with:
>> //  echo 1 > /sys/kernel/tracing/options/fields
>>
>>   # trace-cmd show
>> # tracer: nop
>> #
>> # entries-in-buffer/entries-written: 8/8   #P:8
>> #
>> #                                _-----=> irqs-off/BH-disabled
>> #                               / _----=> need-resched
>> #                              | / _---=> hardirq/softirq
>> #                              || / _--=> preempt-depth
>> #                              ||| / _-=> migrate-disable
>> #                              |||| /     delay
>> #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
>> #              | |         |   |||||     |         |
>>              less-2527    [004] ..... 61949.896458: mm_filemap_add_to_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)
>>              less-2527    [004] d..2. 61949.896926: mm_filemap_delete_from_page_cache: pfn=0x152b07 (1387271) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)
>>       jbd2/vda3-8-268     [005] ..... 61954.461964: mm_filemap_add_to_page_cache: pfn=0x152b70 (1387376) i_ino=0xfe00003 (266338307) index=0x30bd33 (3194163) s_dev=0x3 (3) order=(0)
>>       jbd2/vda3-8-268     [005] ..... 61954.462669: mm_filemap_add_to_page_cache: pfn=0x15335b (1389403) i_ino=0xfe00003 (266338307) index=0x30bd40 (3194176) s_dev=0x3 (3) order=(0)
>>       jbd2/vda3-8-268     [005] ..... 62001.565391: mm_filemap_add_to_page_cache: pfn=0x13a996 (1288598) i_ino=0xfe00003 (266338307) index=0x30bd41 (3194177) s_dev=0x3 (3) order=(0)
>>       jbd2/vda3-8-268     [005] ..... 62001.566081: mm_filemap_add_to_page_cache: pfn=0x1446b5 (1328821) i_ino=0xfe00003 (266338307) index=0x30bd43 (3194179) s_dev=0x3 (3) order=(0)
>>              less-2530    [004] ..... 62033.182309: mm_filemap_add_to_page_cache: pfn=0x13d755 (1300309) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)
>>              less-2530    [004] d..2. 62033.182801: mm_filemap_delete_from_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0)
> 
> Yes, I know about that. But this just makes things harder, because
> now there are *3* different formats that have to be handled (i.e.
> now we also have to strip "()" around numbers).
Perhaps if users want to filter the format, they could enable the 
"fields" option in a unified manner. As for TP_printk(), it depends on 
how to better display the data used for debugging.

Thanks,
Hongbo
> 
> -Dave.
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index cd2ca0c3158e..865c48e92d40 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10188,6 +10188,7 @@  F:	Documentation/mm/hugetlbfs_reserv.rst
 F:	Documentation/mm/vmemmap_dedup.rst
 F:	fs/hugetlbfs/
 F:	include/linux/hugetlb.h
+F:	include/trace/events/hugetlbfs.h
 F:	mm/hugetlb.c
 F:	mm/hugetlb_vmemmap.c
 F:	mm/hugetlb_vmemmap.h
diff --git a/include/trace/events/hugetlbfs.h b/include/trace/events/hugetlbfs.h
new file mode 100644
index 000000000000..975f584f6f51
--- /dev/null
+++ b/include/trace/events/hugetlbfs.h
@@ -0,0 +1,160 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hugetlbfs
+
+#if !defined(_TRACE_HUGETLBFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HUGETLBFS_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(hugetlbfs_alloc_inode,
+
+	TP_PROTO(struct inode *inode, struct inode *dir, int mode),
+
+	TP_ARGS(inode, dir, mode),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+		__field(ino_t,		dir)
+		__field(__u16,		mode)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= inode->i_sb->s_dev;
+		__entry->ino		= inode->i_ino;
+		__entry->dir		= dir->i_ino;
+		__entry->mode		= mode;
+	),
+
+	TP_printk("dev = (%d,%d), ino = %lu, dir = %lu, mode = 0%o",
+		MAJOR(__entry->dev), MINOR(__entry->dev),
+		(unsigned long) __entry->ino,
+		(unsigned long) __entry->dir, __entry->mode)
+);
+
+DECLARE_EVENT_CLASS(hugetlbfs__inode,
+
+	TP_PROTO(struct inode *inode),
+
+	TP_ARGS(inode),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+		__field(__u16,		mode)
+		__field(loff_t,		size)
+		__field(unsigned int,	nlink)
+		__field(unsigned int,	seals)
+		__field(blkcnt_t,	blocks)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= inode->i_sb->s_dev;
+		__entry->ino		= inode->i_ino;
+		__entry->mode		= inode->i_mode;
+		__entry->size		= inode->i_size;
+		__entry->nlink		= inode->i_nlink;
+		__entry->seals		= HUGETLBFS_I(inode)->seals;
+		__entry->blocks		= inode->i_blocks;
+	),
+
+	TP_printk("dev = (%d,%d), ino = %lu, i_mode = 0%o, i_size = %lld, i_nlink = %u, seals = %u, i_blocks = %llu",
+		MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino,
+		__entry->mode, __entry->size, __entry->nlink, __entry->seals,
+		(unsigned long long)__entry->blocks)
+);
+
+DEFINE_EVENT(hugetlbfs__inode, hugetlbfs_evict_inode,
+
+	TP_PROTO(struct inode *inode),
+
+	TP_ARGS(inode)
+);
+
+DEFINE_EVENT(hugetlbfs__inode, hugetlbfs_free_inode,
+
+	TP_PROTO(struct inode *inode),
+
+	TP_ARGS(inode)
+);
+
+TRACE_EVENT(hugetlbfs_setattr,
+
+	TP_PROTO(struct inode *inode, struct dentry *dentry,
+		struct iattr *attr),
+
+	TP_ARGS(inode, dentry, attr),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+		__field(unsigned int,	d_len)
+		__string(d_name,	dentry->d_name.name)
+		__field(unsigned int,	ia_valid)
+		__field(unsigned int,	ia_mode)
+		__field(unsigned int,	ia_uid)
+		__field(unsigned int,	ia_gid)
+		__field(loff_t,		old_size)
+		__field(loff_t,		ia_size)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= inode->i_sb->s_dev;
+		__entry->ino		= inode->i_ino;
+		__entry->d_len		= dentry->d_name.len;
+		__assign_str(d_name);
+		__entry->ia_valid	= attr->ia_valid;
+		__entry->ia_mode	= attr->ia_mode;
+		__entry->ia_uid		= from_kuid_munged(current_user_ns(), attr->ia_uid);
+		__entry->ia_gid		= from_kgid_munged(current_user_ns(), attr->ia_gid);
+		__entry->old_size	= inode->i_size;
+		__entry->ia_size	= attr->ia_size;
+	),
+
+	TP_printk("dev = (%d,%d), ino = %lu, name = %.*s, ia_valid = %u, ia_mode = 0%o, ia_uid = %u, ia_gid = %u, old_size = %lld, ia_size = %lld",
+		MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long)__entry->ino,
+		__entry->d_len, __get_str(d_name), __entry->ia_valid, __entry->ia_mode,
+		__entry->ia_uid, __entry->ia_gid, __entry->old_size, __entry->ia_size)
+);
+
+TRACE_EVENT(hugetlbfs_fallocate,
+
+	TP_PROTO(struct inode *inode, int mode,
+		loff_t offset, loff_t len, int ret),
+
+	TP_ARGS(inode, mode, offset, len, ret),
+
+	TP_STRUCT__entry(
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+		__field(int,		mode)
+		__field(loff_t,		offset)
+		__field(loff_t,		len)
+		__field(loff_t,		size)
+		__field(int,		ret)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= inode->i_sb->s_dev;
+		__entry->ino		= inode->i_ino;
+		__entry->mode		= mode;
+		__entry->offset		= offset;
+		__entry->len		= len;
+		__entry->size		= inode->i_size;
+		__entry->ret		= ret;
+	),
+
+	TP_printk("dev = (%d,%d), ino = %lu, mode = %x, offset = %lld, len = %lld,  i_size = %lld, ret = %d",
+		MAJOR(__entry->dev), MINOR(__entry->dev),
+		(unsigned long)__entry->ino, __entry->mode,
+		(unsigned long long)__entry->offset,
+		(unsigned long long)__entry->len,
+		(unsigned long long)__entry->size,
+		__entry->ret)
+);
+
+#endif /* _TRACE_HUGETLBFS_H */
+
+ /* This part must be outside protection */
+#include <trace/define_trace.h>