diff mbox series

RFC fuse waitq latency

Message ID 6ba14287-336d-cdcd-0d39-680f288ca776@ddn.com (mailing list archive)
State New, archived
Headers show
Series RFC fuse waitq latency | expand

Commit Message

Bernd Schubert March 28, 2022, 1:21 p.m. UTC
I would like to discuss the user thread wake up latency in
fuse_dev_do_read(). Profiling fuse shows there is room for improvement
regarding memory copies and splice. The basic profiling with flame graphs
didn't reveal, though, why fuse is so much
slower (with an overlay file system) than just accessing the underlying
file system directly and also didn't reveal why a single threaded fuse
uses less than 100% cpu, with the application on top of use also using
less than 100% cpu (simple bonnie++ runs with 1B files).
So I started to suspect the wait queues and indeed, keeping the thread
that reads the fuse device for work running for some time gives quite
some improvements.





Without patch above

>                     ------Sequential Create------ --------Random Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>       files:max:min  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> imesrv1   30:1:1:10  5568  28  7784  40  9737  23  5756  29  5709  39  7573  25
> Latency             26813us     654us     965us     261us     550us     336ms



Patch above applied

                     ------Sequential Create------ --------Random Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
       files:max:min  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
imesrv1   30:1:1:10  8043  30 12100  44 14024  22  7791  28  9238  43  9871  25
Latency               235us    3982us    3201us     240us     277us     355ms


So there is quite some improvement by 'just' preventing the thread going to
sleep, with the disadvantage that the thread now spins. This also does not
work that well when libfuse creates multiple threads (still with a single
threaded bonnie++), as wakeup then wakes up different threads, multiple
of them start to spin and without having profiled it, I guess
fiq->lock then might become a problem.

I had also tried to use swaitq instead of waitq, there is a little improvement
with it, but it does not solve the major issue.


Now if the wakeup is an issue, how did zufs avoid it? Looking at its code,
zufs also has a thread wakeup. Same for Miklos' fuse2. On our side
Dharmendra was just read to start to port Miklos' fuse2 to more recent
kernels and to add support into libfuse for it. Now with the waitq
latency we are not sure if this is actually the right approach.



Results above are done with bonnie++
bonnie++ -x 4 -q -s0  -d /scratch/dest/ -n 30:1:1:10 -r 0

using passthrough_hp. This is with additional patches (kernel side
has Dharmendras atomic-open optimizations, libfuse has additional
patches for atomic-open, a libfuse thread creation fix and more
fixes and options for passthrough_hp.cc).

passthrough_hp --foreground --nosplice --nocache --num_threads=1\
     /scratch/source /scratch/dest


Thanks,
Bernd

Comments

Miklos Szeredi April 22, 2022, 12:25 p.m. UTC | #1
On Mon, 28 Mar 2022 at 15:21, Bernd Schubert <bschubert@ddn.com> wrote:
>
> I would like to discuss the user thread wake up latency in
> fuse_dev_do_read(). Profiling fuse shows there is room for improvement
> regarding memory copies and splice. The basic profiling with flame graphs
> didn't reveal, though, why fuse is so much
> slower (with an overlay file system) than just accessing the underlying
> file system directly and also didn't reveal why a single threaded fuse
> uses less than 100% cpu, with the application on top of use also using
> less than 100% cpu (simple bonnie++ runs with 1B files).
> So I started to suspect the wait queues and indeed, keeping the thread
> that reads the fuse device for work running for some time gives quite
> some improvements.

Might be related: I experimented with wake_up_sync() that didn't meet
my expectations.  See this thread:

https://lore.kernel.org/all/1638780405-38026-1-git-send-email-quic_pragalla@quicinc.com/#r

Possibly fuse needs some wake up tweaks due to its special scheduling
requirements.

Thanks,
Miklos
Bernd Schubert April 22, 2022, 3:46 p.m. UTC | #2
[I removed the failing netapp/zufs CCs]

On 4/22/22 14:25, Miklos Szeredi wrote:
> On Mon, 28 Mar 2022 at 15:21, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> I would like to discuss the user thread wake up latency in
>> fuse_dev_do_read(). Profiling fuse shows there is room for improvement
>> regarding memory copies and splice. The basic profiling with flame graphs
>> didn't reveal, though, why fuse is so much
>> slower (with an overlay file system) than just accessing the underlying
>> file system directly and also didn't reveal why a single threaded fuse
>> uses less than 100% cpu, with the application on top of use also using
>> less than 100% cpu (simple bonnie++ runs with 1B files).
>> So I started to suspect the wait queues and indeed, keeping the thread
>> that reads the fuse device for work running for some time gives quite
>> some improvements.
> 
> Might be related: I experimented with wake_up_sync() that didn't meet
> my expectations.  See this thread:
> 
> https://lore.kernel.org/all/1638780405-38026-1-git-send-email-quic_pragalla@quicinc.com/#r
> 
> Possibly fuse needs some wake up tweaks due to its special scheduling
> requirements.

Thanks I will look at that as well. I have a patch with spinning and 
avoid of thread wake  that is almost complete and in my (still limited) 
testing almost does not take more CPU and improves meta data / bonnie 
performance in between factor ~1.9 and 3, depending on in which 
performance mode the cpu is.

https://github.com/aakefbs/linux/commits/v5.17-fuse-scheduling3

Missing is just another option for wake-queue-size trigger and handling 
of signals. Should be ready once I'm done with my other work.

That being said, in the mean time I do believe a better approach would 
be SQ/CQ like, similar to NVME or io_uring. In principle exactly as 
io_uring, just the other way around - kernel fills in SQ, user space 
consumes it and fills CQ. We also looked into zufs and your fuse2 branch 
and were almost ready to start to port it to a recent kernel, but it is 
still all systemcall based and has waitq's - probably much slower than 
what could be achieved through queue pairs. Assuming userspace would not 
want a polling thread, but would want a notification similar to 
io_uring_enter(), there would be still a thread needed to be woken up, 
may that is where wake_up_sync() would help.

Btw, the optional kernel polling thread in io_uring also has spinning...


Bernd
Miklos Szeredi April 25, 2022, 8:37 a.m. UTC | #3
On Fri, 22 Apr 2022 at 17:46, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>
> [I removed the failing netapp/zufs CCs]
>
> On 4/22/22 14:25, Miklos Szeredi wrote:
> > On Mon, 28 Mar 2022 at 15:21, Bernd Schubert <bschubert@ddn.com> wrote:
> >>
> >> I would like to discuss the user thread wake up latency in
> >> fuse_dev_do_read(). Profiling fuse shows there is room for improvement
> >> regarding memory copies and splice. The basic profiling with flame graphs
> >> didn't reveal, though, why fuse is so much
> >> slower (with an overlay file system) than just accessing the underlying
> >> file system directly and also didn't reveal why a single threaded fuse
> >> uses less than 100% cpu, with the application on top of use also using
> >> less than 100% cpu (simple bonnie++ runs with 1B files).
> >> So I started to suspect the wait queues and indeed, keeping the thread
> >> that reads the fuse device for work running for some time gives quite
> >> some improvements.
> >
> > Might be related: I experimented with wake_up_sync() that didn't meet
> > my expectations.  See this thread:
> >
> > https://lore.kernel.org/all/1638780405-38026-1-git-send-email-quic_pragalla@quicinc.com/#r
> >
> > Possibly fuse needs some wake up tweaks due to its special scheduling
> > requirements.
>
> Thanks I will look at that as well. I have a patch with spinning and
> avoid of thread wake  that is almost complete and in my (still limited)
> testing almost does not take more CPU and improves meta data / bonnie
> performance in between factor ~1.9 and 3, depending on in which
> performance mode the cpu is.
>
> https://github.com/aakefbs/linux/commits/v5.17-fuse-scheduling3
>
> Missing is just another option for wake-queue-size trigger and handling
> of signals. Should be ready once I'm done with my other work.

Trying to understand what is being optimized here...  does the
following correctly describe your use case?

- an I/O thread is submitting synchronous requests (direct I/O?)

- the fuse thread always goes to sleep, because the request queue is
empty (there's always a single request on the queue)

- with this change the fuse thread spins for a jiffy before going to
sleep, and by that time the I/O thread will submit a new sync request.

- the I/O thread does not spin while the the fuse thread is processing
the request, so it still goes to sleep.

Thanks,
Miklos
Bernd Schubert April 28, 2022, 9:23 p.m. UTC | #4
Sorry for my late reply, I'm on vacation and family visit this week.

On 4/25/22 10:37, Miklos Szeredi wrote:
> On Fri, 22 Apr 2022 at 17:46, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>
>> [I removed the failing netapp/zufs CCs]
>>
>> On 4/22/22 14:25, Miklos Szeredi wrote:
>>> On Mon, 28 Mar 2022 at 15:21, Bernd Schubert <bschubert@ddn.com> wrote:
>>>>
>>>> I would like to discuss the user thread wake up latency in
>>>> fuse_dev_do_read(). Profiling fuse shows there is room for improvement
>>>> regarding memory copies and splice. The basic profiling with flame graphs
>>>> didn't reveal, though, why fuse is so much
>>>> slower (with an overlay file system) than just accessing the underlying
>>>> file system directly and also didn't reveal why a single threaded fuse
>>>> uses less than 100% cpu, with the application on top of use also using
>>>> less than 100% cpu (simple bonnie++ runs with 1B files).
>>>> So I started to suspect the wait queues and indeed, keeping the thread
>>>> that reads the fuse device for work running for some time gives quite
>>>> some improvements.
>>>
>>> Might be related: I experimented with wake_up_sync() that didn't meet
>>> my expectations.  See this thread:
>>>
>>> https://lore.kernel.org/all/1638780405-38026-1-git-send-email-quic_pragalla@quicinc.com/#r
>>>
>>> Possibly fuse needs some wake up tweaks due to its special scheduling
>>> requirements.
>>
>> Thanks I will look at that as well. I have a patch with spinning and
>> avoid of thread wake  that is almost complete and in my (still limited)
>> testing almost does not take more CPU and improves meta data / bonnie
>> performance in between factor ~1.9 and 3, depending on in which
>> performance mode the cpu is.
>>
>> https://github.com/aakefbs/linux/commits/v5.17-fuse-scheduling3
>>
>> Missing is just another option for wake-queue-size trigger and handling
>> of signals. Should be ready once I'm done with my other work.
> 
> Trying to understand what is being optimized here...  does the
> following correctly describe your use case?
> 
> - an I/O thread is submitting synchronous requests (direct I/O?)
> 
> - the fuse thread always goes to sleep, because the request queue is
> empty (there's always a single request on the queue)
> 
> - with this change the fuse thread spins for a jiffy before going to
> sleep, and by that time the I/O thread will submit a new sync request.
> 
> - the I/O thread does not spin while the the fuse thread is processing
> the request, so it still goes to sleep.


Yes, this describes it well. We basically noticed weird effects with 
multiple fuse threads when you had asked for benchmarks of the atomic 
create/open patches. In our HPC world the standard for such benchmarks 
is to use mdtest, but for simplicity I personally prefer bonnie++, like 
"bonnie++ -s0 -n10:1:1:10 -d <dest-path>"

Initial results were rather confusing, as reduced number of requests 
could result in lower performance. So I started to investigate and found 
a number of issues

1) passthrough_ll is using a single linked list to store inodes, we 
later switched to passthrough_hp which uses a C++ map to avoid the O(N) 
inode search

2) limiting the number of threads in libfuse using the max_idle_threads 
variable caused additional high cpu usage - there was permanent pthread 
creation/destruction. I have submitted patches for that (additional 
difficulty is to fix the API to avoid uninitialized struct members in 
libfuse3)

3) There is some overhead with splice for small requests like meta data. 
Even though the libfuse already tries to use splice for larger requests 
only. But unless disabled it still does a splice system call for the 
request header - enough to introduce a perf penalty. I have some very 
experimental patches for that as well, although it got much more 
difficult than I had initially hoped for. With these patches applied I 
started to profile the system with flame graphs and noticed that 
performance is much lower than it could be explained by the fuse cpu 
overhead.

4) Figured out about the waitq overhead. In the mean time I'm rather 
surprised about the zufs results - benchmarks had been done with at 
least  n-application thread >= 2 x n-zufs threads? Using thread spinning 
might avoid the issue, but with request queue per core in worst case all 
system cores might go a bit into spinning mode - at least not idea for 
embedded systems. And also not ideal for power consumption on laptops or 
phones and neither for HPC systems where systems are supposed to be busy 
to do calculations.

4.1) A sub problem of the waitq is the sleep condition - it checks if 
there are no pending requests - threads on different cores randomly wake 
up, even with avoided explicit thread wake as in my patches.

Right now I'm at a point where I see that my patches help to improve 
performance, but I'm not totally happy with the solution myself. That is 
basically where I believe that a SQ/CQ approach would give better 
performance and should/might avoid additional complexity.  At a minimum 
the request queue (SQ) spinning could be totally controlled/handled in 
user space.
Just need to find the time to code it...


Bernd
diff mbox series

Patch

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 592730fd6e42..20b7cf296fb0 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1034,7 +1034,7 @@  static int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
  
  static int forget_pending(struct fuse_iqueue *fiq)
  {
-       return fiq->forget_list_head.next != NULL;
+       return READ_ONCE(fiq->forget_list_head.next) != NULL;
  }
  
  static int request_pending(struct fuse_iqueue *fiq)
@@ -1237,18 +1237,25 @@  static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
                 return -EINVAL;
  
   restart:
+       expires = jiffies + 10;
         for (;;) {
-               spin_lock(&fiq->lock);
-               if (!fiq->connected || request_pending(fiq))
-                       break;
-               spin_unlock(&fiq->lock);
  
+               if (!READ_ONCE(fiq->connected) || request_pending(fiq)) {
+                       spin_lock(&fiq->lock);
+                       if (!fiq->connected || request_pending(fiq))
+                               break;
+                       spin_unlock(&fiq->lock);
+               }
                 if (file->f_flags & O_NONBLOCK)
                         return -EAGAIN;
-               err = wait_event_interruptible_exclusive(fiq->waitq,
-                               !fiq->connected || request_pending(fiq));
+
+               if (time_after_eq(jiffies, expires))
+                       err = wait_event_interruptible_exclusive(fiq->waitq,
+                                       !fiq->connected || request_pending(fiq));
                 if (err)
                         return err;
+
+               cond_resched();
         }
  
         if (!fiq->connected) {