diff mbox series

[RFC,16/20] famfs: Add fault counters

Message ID 43245b463f00506016b8c39c0252faf62bd73e35.1708709155.git.john@groves.net
State Superseded
Headers show
Series Introduce the famfs shared-memory file system | expand

Commit Message

John Groves Feb. 23, 2024, 5:42 p.m. UTC
One of the key requirements for famfs is that it service vma faults
efficiently. Our metadata helps - the search order is n for n extents,
and n is usually 1. But we can still observe gnarly lock contention
in mm if PTE faults are happening. This commit introduces fault counters
that can be enabled and read via /sys/fs/famfs/...

These counters have proved useful in troubleshooting situations where
PTE faults were happening instead of PMD. No performance impact when
disabled.

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_file.c     | 97 +++++++++++++++++++++++++++++++++++++++
 fs/famfs/famfs_internal.h | 73 +++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)

Comments

Dave Hansen Feb. 23, 2024, 6:23 p.m. UTC | #1
On 2/23/24 09:42, John Groves wrote:
> One of the key requirements for famfs is that it service vma faults
> efficiently. Our metadata helps - the search order is n for n extents,
> and n is usually 1. But we can still observe gnarly lock contention
> in mm if PTE faults are happening. This commit introduces fault counters
> that can be enabled and read via /sys/fs/famfs/...
> 
> These counters have proved useful in troubleshooting situations where
> PTE faults were happening instead of PMD. No performance impact when
> disabled.

This seems kinda wonky.  Why does _this_ specific filesystem need its
own fault counters.  Seems like something we'd want to do much more
generically, if it is needed at all.

Was the issue here just that vm_ops->fault() was getting called instead
of ->huge_fault()?  Or something more subtle?
John Groves Feb. 23, 2024, 7:56 p.m. UTC | #2
On 24/02/23 10:23AM, Dave Hansen wrote:
> On 2/23/24 09:42, John Groves wrote:
> > One of the key requirements for famfs is that it service vma faults
> > efficiently. Our metadata helps - the search order is n for n extents,
> > and n is usually 1. But we can still observe gnarly lock contention
> > in mm if PTE faults are happening. This commit introduces fault counters
> > that can be enabled and read via /sys/fs/famfs/...
> > 
> > These counters have proved useful in troubleshooting situations where
> > PTE faults were happening instead of PMD. No performance impact when
> > disabled.
> 
> This seems kinda wonky.  Why does _this_ specific filesystem need its
> own fault counters.  Seems like something we'd want to do much more
> generically, if it is needed at all.
> 
> Was the issue here just that vm_ops->fault() was getting called instead
> of ->huge_fault()?  Or something more subtle?

Thanks for your reply Dave!

First, I'm willing to pull the fault counters out if the brain trust doesn't
like them.

I put them in because we were running benchmarks of computational data
analytics and and noted that jobs took 3x as long on famfs as raw dax -
which indicated I was doing something wrong, because it should be equivalent
or very close.

The the solution was to call thp_get_unmapped_area() in
famfs_file_operations, and performance doesn't vary significantly from raw
dax now. Prior to that I wasn't making sure the mmap address was PMD aligned.

After that I wanted a way to be double-secret-certain that it was servicing
PMD faults as intended. Which it basically always is, so far. (The smoke
tests in user space check this.)

John
Dan Williams Feb. 23, 2024, 8:04 p.m. UTC | #3
John Groves wrote:
> On 24/02/23 10:23AM, Dave Hansen wrote:
> > On 2/23/24 09:42, John Groves wrote:
> > > One of the key requirements for famfs is that it service vma faults
> > > efficiently. Our metadata helps - the search order is n for n extents,
> > > and n is usually 1. But we can still observe gnarly lock contention
> > > in mm if PTE faults are happening. This commit introduces fault counters
> > > that can be enabled and read via /sys/fs/famfs/...
> > > 
> > > These counters have proved useful in troubleshooting situations where
> > > PTE faults were happening instead of PMD. No performance impact when
> > > disabled.
> > 
> > This seems kinda wonky.  Why does _this_ specific filesystem need its
> > own fault counters.  Seems like something we'd want to do much more
> > generically, if it is needed at all.
> > 
> > Was the issue here just that vm_ops->fault() was getting called instead
> > of ->huge_fault()?  Or something more subtle?
> 
> Thanks for your reply Dave!
> 
> First, I'm willing to pull the fault counters out if the brain trust doesn't
> like them.
> 
> I put them in because we were running benchmarks of computational data
> analytics and and noted that jobs took 3x as long on famfs as raw dax -
> which indicated I was doing something wrong, because it should be equivalent
> or very close.
> 
> The the solution was to call thp_get_unmapped_area() in
> famfs_file_operations, and performance doesn't vary significantly from raw
> dax now. Prior to that I wasn't making sure the mmap address was PMD aligned.
> 
> After that I wanted a way to be double-secret-certain that it was servicing
> PMD faults as intended. Which it basically always is, so far. (The smoke
> tests in user space check this.)

We had similar unit test regression concerns with fsdax where some
upstream change silently broke PMD faults. The solution there was trace
points in the fault handlers and a basic test that knows apriori that it
*should* be triggering a certain number of huge faults:

https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31
John Groves Feb. 23, 2024, 8:39 p.m. UTC | #4
On 24/02/23 12:04PM, Dan Williams wrote:
> John Groves wrote:
> > On 24/02/23 10:23AM, Dave Hansen wrote:
> > > On 2/23/24 09:42, John Groves wrote:
> > > > One of the key requirements for famfs is that it service vma faults
> > > > efficiently. Our metadata helps - the search order is n for n extents,
> > > > and n is usually 1. But we can still observe gnarly lock contention
> > > > in mm if PTE faults are happening. This commit introduces fault counters
> > > > that can be enabled and read via /sys/fs/famfs/...
> > > > 
> > > > These counters have proved useful in troubleshooting situations where
> > > > PTE faults were happening instead of PMD. No performance impact when
> > > > disabled.
> > > 
> > > This seems kinda wonky.  Why does _this_ specific filesystem need its
> > > own fault counters.  Seems like something we'd want to do much more
> > > generically, if it is needed at all.
> > > 
> > > Was the issue here just that vm_ops->fault() was getting called instead
> > > of ->huge_fault()?  Or something more subtle?
> > 
> > Thanks for your reply Dave!
> > 
> > First, I'm willing to pull the fault counters out if the brain trust doesn't
> > like them.
> > 
> > I put them in because we were running benchmarks of computational data
> > analytics and and noted that jobs took 3x as long on famfs as raw dax -
> > which indicated I was doing something wrong, because it should be equivalent
> > or very close.
> > 
> > The the solution was to call thp_get_unmapped_area() in
> > famfs_file_operations, and performance doesn't vary significantly from raw
> > dax now. Prior to that I wasn't making sure the mmap address was PMD aligned.
> > 
> > After that I wanted a way to be double-secret-certain that it was servicing
> > PMD faults as intended. Which it basically always is, so far. (The smoke
> > tests in user space check this.)
> 
> We had similar unit test regression concerns with fsdax where some
> upstream change silently broke PMD faults. The solution there was trace
> points in the fault handlers and a basic test that knows apriori that it
> *should* be triggering a certain number of huge faults:
> 
> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31

Good approach, thanks Dan! My working assumption is that we'll be able to make
that approach work in the famfs tests. So the fault counters should go away
in the next version.

John
Dave Hansen Feb. 23, 2024, 9:19 p.m. UTC | #5
On 2/23/24 12:39, John Groves wrote:
>> We had similar unit test regression concerns with fsdax where some
>> upstream change silently broke PMD faults. The solution there was trace
>> points in the fault handlers and a basic test that knows apriori that it
>> *should* be triggering a certain number of huge faults:
>>
>> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31
> Good approach, thanks Dan! My working assumption is that we'll be able to make
> that approach work in the famfs tests. So the fault counters should go away
> in the next version.

I do really suspect there's something more generic that should be done
here.  Maybe we need a generic 'huge_faults' perf event to pair up with
the good ol' faults that we already have:

# perf stat -e faults /bin/ls

 Performance counter stats for '/bin/ls':

               104      faults


       0.001499862 seconds time elapsed

       0.001490000 seconds user
       0.000000000 seconds sys
Dan Williams Feb. 23, 2024, 11:50 p.m. UTC | #6
Dave Hansen wrote:
> On 2/23/24 12:39, John Groves wrote:
> >> We had similar unit test regression concerns with fsdax where some
> >> upstream change silently broke PMD faults. The solution there was trace
> >> points in the fault handlers and a basic test that knows apriori that it
> >> *should* be triggering a certain number of huge faults:
> >>
> >> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31
> > Good approach, thanks Dan! My working assumption is that we'll be able to make
> > that approach work in the famfs tests. So the fault counters should go away
> > in the next version.
> 
> I do really suspect there's something more generic that should be done
> here.  Maybe we need a generic 'huge_faults' perf event to pair up with
> the good ol' faults that we already have:
> 
> # perf stat -e faults /bin/ls
> 
>  Performance counter stats for '/bin/ls':
> 
>                104      faults
> 
> 
>        0.001499862 seconds time elapsed
> 
>        0.001490000 seconds user
>        0.000000000 seconds sys

Certainly something like that would have satisified this sanity test use
case. I will note that mm_account_fault() would need some help to figure
out the size of the page table entry that got installed. Maybe
extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments
VM_FAULT_FALLBACK to indicate whether, for example, the fallback went
from PUD to PMD, or all the way back to PTE.

Then use cases like this could just add a dynamic probe in
mm_account_fault(). No real need for a new tracepoint unless there was a
use case for this outside of regression testing fault handlers, right?
Matthew Wilcox (Oracle) Feb. 24, 2024, 3:59 a.m. UTC | #7
On Fri, Feb 23, 2024 at 03:50:33PM -0800, Dan Williams wrote:
> Certainly something like that would have satisified this sanity test use
> case. I will note that mm_account_fault() would need some help to figure
> out the size of the page table entry that got installed. Maybe
> extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments
> VM_FAULT_FALLBACK to indicate whether, for example, the fallback went
> from PUD to PMD, or all the way back to PTE.

ugh, no, it's more complicated than that.  look at the recent changes to
set_ptes().  we can now install PTEs of many different sizes, depending
on the architecture.  someday i look forward to supporting all the page
sizes on parisc (4k, 16k, 64k, 256k, ... 4G)
Dan Williams Feb. 24, 2024, 4:30 a.m. UTC | #8
Matthew Wilcox wrote:
> On Fri, Feb 23, 2024 at 03:50:33PM -0800, Dan Williams wrote:
> > Certainly something like that would have satisified this sanity test use
> > case. I will note that mm_account_fault() would need some help to figure
> > out the size of the page table entry that got installed. Maybe
> > extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments
> > VM_FAULT_FALLBACK to indicate whether, for example, the fallback went
> > from PUD to PMD, or all the way back to PTE.
> 
> ugh, no, it's more complicated than that.  look at the recent changes to
> set_ptes().  we can now install PTEs of many different sizes, depending
> on the architecture.  someday i look forward to supporting all the page
> sizes on parisc (4k, 16k, 64k, 256k, ... 4G)

Nice!

There are enough bits in vm_fault_t to represent many page sizes instead
of the entry type as I suggested, but I would defer to you or Dave on
how to make "installed pte size" generically traceable per Dave's
suggestion.
diff mbox series

Patch

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
index fd42d5966982..a626f8a89790 100644
--- a/fs/famfs/famfs_file.c
+++ b/fs/famfs/famfs_file.c
@@ -19,6 +19,100 @@ 
 #include <uapi/linux/famfs_ioctl.h>
 #include "famfs_internal.h"
 
+/***************************************************************************************
+ * filemap_fault counters
+ *
+ * The counters and the fault_count_enable file live at
+ * /sys/fs/famfs/
+ */
+struct famfs_fault_counters ffc;
+static int fault_count_enable;
+
+static ssize_t
+fault_count_enable_show(struct kobject *kobj,
+			struct kobj_attribute *attr,
+			char *buf)
+{
+	return sprintf(buf, "%d\n", fault_count_enable);
+}
+
+static ssize_t
+fault_count_enable_store(struct kobject        *kobj,
+			 struct kobj_attribute *attr,
+			 const char            *buf,
+			 size_t                 count)
+{
+	int value;
+	int rc;
+
+	rc = sscanf(buf, "%d", &value);
+	if (rc != 1)
+		return 0;
+
+	if (value > 0) /* clear fault counters when enabling, but not when disabling */
+		famfs_clear_fault_counters(&ffc);
+
+	fault_count_enable = value;
+	return count;
+}
+
+/* Individual fault counters are read-only */
+static ssize_t
+fault_count_pte_show(struct kobject *kobj,
+		     struct kobj_attribute *attr,
+		     char *buf)
+{
+	return sprintf(buf, "%llu", famfs_pte_fault_ct(&ffc));
+}
+
+static ssize_t
+fault_count_pmd_show(struct kobject *kobj,
+		     struct kobj_attribute *attr,
+		     char *buf)
+{
+	return sprintf(buf, "%llu", famfs_pmd_fault_ct(&ffc));
+}
+
+static ssize_t
+fault_count_pud_show(struct kobject *kobj,
+		     struct kobj_attribute *attr,
+		     char *buf)
+{
+	return sprintf(buf, "%llu", famfs_pud_fault_ct(&ffc));
+}
+
+static struct kobj_attribute fault_count_enable_attribute = __ATTR(fault_count_enable,
+								   0660,
+								   fault_count_enable_show,
+								   fault_count_enable_store);
+static struct kobj_attribute fault_count_pte_attribute = __ATTR(pte_fault_ct,
+								0440,
+								fault_count_pte_show,
+								NULL);
+static struct kobj_attribute fault_count_pmd_attribute = __ATTR(pmd_fault_ct,
+								0440,
+								fault_count_pmd_show,
+								NULL);
+static struct kobj_attribute fault_count_pud_attribute = __ATTR(pud_fault_ct,
+								0440,
+								fault_count_pud_show,
+								NULL);
+
+
+static struct attribute *attrs[] = {
+	&fault_count_enable_attribute.attr,
+	&fault_count_pte_attribute.attr,
+	&fault_count_pmd_attribute.attr,
+	&fault_count_pud_attribute.attr,
+	NULL,
+};
+
+struct attribute_group famfs_attr_group = {
+	.attrs = attrs,
+};
+
+/* End fault counters */
+
 /**
  * famfs_map_meta_alloc() - Allocate famfs file metadata
  * @mapp:       Pointer to an mcache_map_meta pointer
@@ -525,6 +619,9 @@  __famfs_filemap_fault(
 	if (IS_DAX(inode)) {
 		pfn_t pfn;
 
+		if (fault_count_enable)
+			famfs_inc_fault_counter_by_order(&ffc, pe_size);
+
 		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops);
 		if (ret & VM_FAULT_NEEDDSYNC)
 			ret = dax_finish_sync_fault(vmf, pe_size, pfn);
diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h
index af3990d43305..987cb172a149 100644
--- a/fs/famfs/famfs_internal.h
+++ b/fs/famfs/famfs_internal.h
@@ -50,4 +50,77 @@  struct famfs_fs_info {
 	char                    *rootdev;
 };
 
+/*
+ * filemap_fault counters
+ */
+extern struct attribute_group famfs_attr_group;
+
+enum famfs_fault {
+	FAMFS_PTE = 0,
+	FAMFS_PMD,
+	FAMFS_PUD,
+	FAMFS_NUM_FAULT_TYPES,
+};
+
+static inline int valid_fault_type(int type)
+{
+	if (unlikely(type < 0 || type > FAMFS_PUD))
+		return 0;
+	return 1;
+}
+
+struct famfs_fault_counters {
+	atomic64_t fault_ct[FAMFS_NUM_FAULT_TYPES];
+};
+
+extern struct famfs_fault_counters ffc;
+
+static inline void famfs_clear_fault_counters(struct famfs_fault_counters *fc)
+{
+	int i;
+
+	for (i = 0; i < FAMFS_NUM_FAULT_TYPES; i++)
+		atomic64_set(&fc->fault_ct[i], 0);
+}
+
+static inline void famfs_inc_fault_counter(struct famfs_fault_counters *fc,
+					   enum famfs_fault type)
+{
+	if (valid_fault_type(type))
+		atomic64_inc(&fc->fault_ct[type]);
+}
+
+static inline void famfs_inc_fault_counter_by_order(struct famfs_fault_counters *fc, int order)
+{
+	int pgf = -1;
+
+	switch (order) {
+	case 0:
+		pgf = FAMFS_PTE;
+		break;
+	case PMD_ORDER:
+		pgf = FAMFS_PMD;
+		break;
+	case PUD_ORDER:
+		pgf = FAMFS_PUD;
+		break;
+	}
+	famfs_inc_fault_counter(fc, pgf);
+}
+
+static inline u64 famfs_pte_fault_ct(struct famfs_fault_counters *fc)
+{
+	return atomic64_read(&fc->fault_ct[FAMFS_PTE]);
+}
+
+static inline u64 famfs_pmd_fault_ct(struct famfs_fault_counters *fc)
+{
+	return atomic64_read(&fc->fault_ct[FAMFS_PMD]);
+}
+
+static inline u64 famfs_pud_fault_ct(struct famfs_fault_counters *fc)
+{
+	return atomic64_read(&fc->fault_ct[FAMFS_PUD]);
+}
+
 #endif /* FAMFS_INTERNAL_H */