/etc/mtab read ~900 times by rpc.mountd

Message ID	87efto69rs.fsf@notabene.neil.brown.name (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> From: NeilBrown <neilb@suse.com> To: Phil Kauffman <kauffman@cs.uchicago.edu>, linux-nfs@vger.kernel.org Date: Mon, 10 Jul 2017 17:17:27 +1000 Subject: Re: /etc/mtab read ~900 times by rpc.mountd In-Reply-To: <595F1A3A.7070405@cs.uchicago.edu> References: <f9dcc371-cd94-6b39-dd6a-b412bdf03df4@cs.uchicago.edu> <8737a9x9ky.fsf@notabene.neil.brown.name> <595F1A3A.7070405@cs.uchicago.edu> Message-ID: <87efto69rs.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

NeilBrown July 10, 2017, 7:17 a.m. UTC

On Fri, Jul 07 2017, Phil Kauffman wrote:

>> I can imagine /etc/mtab being read once for every line in /etc/exports,
>> but unless your /etc/exports is very big, I can't easily see why it
>> would be read 900 times.
>> Maybe lots of different mount points are being accessed by something and
>> each one triggers a few reads...
>>
>> Can you show your /etc/exports file?
>
> These get generated via puppet and dropped into /etc/exports.d/
...
>
> Here we can see that there are 960 files in /etc/exports.d
> root@storage1:~# find /etc/exports.d/ -type f | wc
> 960 960 88701

Ahhh.. 960 exports.  That could do it.

$ grep -c crossmnt all_etc_exports.d_in_one_file.txt
957

I get 957 - much the same number.

$ grep open redacted_strace.txt | uniq -c
      2 open("/var/lib/nfs/etab", O_RDONLY)     = 6
    942 open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 6
      1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6

/etc/mtab was ready 942 for a single access.  I would have expected 957,
but maybe the system is dynamic and something changed between the two
samples.

This makes it fairly clear what is happening.  Now we just need to fix
it.
One option would be to cache some of the details extracted from
/etc/mtab, but that could get messy.

Another is to move the code around.  In your case there are really just
3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in
the combined exports file is really different in different files).
So any one client only needs to consider 3 exports, not 300.

There is room to optimize this code further than the below, but let's
start simple.  Could you test with this patch applied and see what
difference it makes?

Thanks,
NeilBrown

Karel Zak July 10, 2017, 11:47 a.m. UTC | #1

On Mon, Jul 10, 2017 at 05:17:27PM +1000, NeilBrown wrote:
> On Fri, Jul 07 2017, Phil Kauffman wrote:
> 
> >> I can imagine /etc/mtab being read once for every line in /etc/exports,
> >> but unless your /etc/exports is very big, I can't easily see why it
> >> would be read 900 times.
> >> Maybe lots of different mount points are being accessed by something and
> >> each one triggers a few reads...
> >>
> >> Can you show your /etc/exports file?
> >
> > These get generated via puppet and dropped into /etc/exports.d/
> ...
> >
> > Here we can see that there are 960 files in /etc/exports.d
> > root@storage1:~# find /etc/exports.d/ -type f | wc
> > 960 960 88701
> 
> Ahhh.. 960 exports.  That could do it.
> 
> $ grep -c crossmnt all_etc_exports.d_in_one_file.txt
> 957
> 
> I get 957 - much the same number.
> 
> $ grep open redacted_strace.txt | uniq -c
>       2 open("/var/lib/nfs/etab", O_RDONLY)     = 6
>     942 open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 6
>       1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6
> 
> /etc/mtab was ready 942 for a single access.  I would have expected 957,
> but maybe the system is dynamic and something changed between the two
> samples.
> 
> This makes it fairly clear what is happening.  Now we just need to fix
> it.
> One option would be to cache some of the details extracted from
> /etc/mtab, but that could get messy.

Would be possible to read /proc/self/mounts only once and use poll()
to detect changes?

The issue is not only the open()+read() in the process, but also 
kernel has to compose the file from the mount tree. It's expensive.

    Karel

Phil Kauffman July 10, 2017, 8:59 p.m. UTC | #2

On 07/10/2017 02:17 AM, NeilBrown wrote:
> Another is to move the code around.  In your case there are really just
> 3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in
> the combined exports file is really different in different files).
> So any one client only needs to consider 3 exports, not 300.
This assumption is correct. We have ~300 clients and growing (slowly).

> Could you test with this patch applied and see what
> difference it makes?
After being confused by conflicting test results and determining that the service nfs-server takes a bit longer to start than 'systemctl' will let you believe, I believe Niel's patch works.


The new cache.c file I am using to test with: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/cache.c
  It also contains the official patch from this thread: http://marc.info/?l=linux-nfs&m=138900709311582&w=2

Building the new deb packages: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/build_deb.txt

Install nfs-kernel-server and nfs-common debs on server and nfs-common on client.


I reboot everything: client and server

Run ssh for loop (no stracing) with the result that every user was able to ssh in under a second (I consider this to be a success).
  http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_result_no-strace.txt

I then used a larger sample size of users (~/f) and got a similar result.


Now lets strace rpc.mountd while sshing with one user:

$ time ssh -o BatchMode=yes $CLIENTHOSTNAME -l user_not_in_list_f
Permission denied (publickey,password).

real	0m0.352s
user	0m0.020s
sys	0m0.012s

The resulting strace seems entirely reasonable: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_one_user_strace.txt


$ for user in $(cat ~/users | grep -E "^g*"); do time ssh -o BatchMode=yes $CLIENTHOSTNAME -l $user; done
  The first few dozen users whos usernames happen to start with 'g': 
    http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_multiple_user_strace.txt

NeilBrown July 10, 2017, 11:19 p.m. UTC | #3

On Mon, Jul 10 2017, Karel Zak wrote:

> On Mon, Jul 10, 2017 at 05:17:27PM +1000, NeilBrown wrote:
>> On Fri, Jul 07 2017, Phil Kauffman wrote:
>> 
>> >> I can imagine /etc/mtab being read once for every line in /etc/exports,
>> >> but unless your /etc/exports is very big, I can't easily see why it
>> >> would be read 900 times.
>> >> Maybe lots of different mount points are being accessed by something and
>> >> each one triggers a few reads...
>> >>
>> >> Can you show your /etc/exports file?
>> >
>> > These get generated via puppet and dropped into /etc/exports.d/
>> ...
>> >
>> > Here we can see that there are 960 files in /etc/exports.d
>> > root@storage1:~# find /etc/exports.d/ -type f | wc
>> > 960 960 88701
>> 
>> Ahhh.. 960 exports.  That could do it.
>> 
>> $ grep -c crossmnt all_etc_exports.d_in_one_file.txt
>> 957
>> 
>> I get 957 - much the same number.
>> 
>> $ grep open redacted_strace.txt | uniq -c
>>       2 open("/var/lib/nfs/etab", O_RDONLY)     = 6
>>     942 open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 6
>>       1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6
>> 
>> /etc/mtab was ready 942 for a single access.  I would have expected 957,
>> but maybe the system is dynamic and something changed between the two
>> samples.
>> 
>> This makes it fairly clear what is happening.  Now we just need to fix
>> it.
>> One option would be to cache some of the details extracted from
>> /etc/mtab, but that could get messy.
>
> Would be possible to read /proc/self/mounts only once and use poll()
> to detect changes?

Sure it is possible.  All we need is code....

>
> The issue is not only the open()+read() in the process, but also 
> kernel has to compose the file from the mount tree. It's expensive.

That certainly contributes to the problem.  If we cache the content, we
can process it fast.  If we avoid wanting to look at the content, it
doesn't matter how fast processes it is (because we don't).

So yes, that might be part of the solution.  But not the whole solution.

Thanks,
NeilBrown


>
>     Karel
>
>
> -- 
>  Karel Zak  <kzak@redhat.com>
>  http://karelzak.blogspot.com

NeilBrown July 10, 2017, 11:51 p.m. UTC | #4

On Mon, Jul 10 2017, Phil Kauffman wrote:

> On 07/10/2017 02:17 AM, NeilBrown wrote:
>> Another is to move the code around.  In your case there are really just
>> 3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in
>> the combined exports file is really different in different files).
>> So any one client only needs to consider 3 exports, not 300.
> This assumption is correct. We have ~300 clients and growing (slowly).
>
>> Could you test with this patch applied and see what
>> difference it makes?
> After being confused by conflicting test results and determining that the service nfs-server takes a bit longer to start than 'systemctl' will let you believe, I believe Niel's patch works.
>
>
> The new cache.c file I am using to test with: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/cache.c
>   It also contains the official patch from this thread: http://marc.info/?l=linux-nfs&m=138900709311582&w=2
>
> Building the new deb packages: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/build_deb.txt
>
> Install nfs-kernel-server and nfs-common debs on server and nfs-common on client.
>
>
> I reboot everything: client and server
>
> Run ssh for loop (no stracing) with the result that every user was able to ssh in under a second (I consider this to be a success).

This does look encouraging ... but I'm not sure we are comparing apples
with apples.

In the original strace we see.

 read(4, ....)
 check on etab
 statfs(some path)
 write(4,....

which is exactly like the new strace.
But then the old strace also has

 read(5, ...)
 check on etab
 read /etc/mtab hundreds of times.
 write(5, ...)

which is completely missing in the new strace.  So the new code which
doesn't read /etc/mtab as much, isn't being run at all.

file desc 4 is /proc/1926/net/rpc/nfsd.export/channel
file desc 5 is /proc/1926/net/rpc/nfsd.fh/channel

If the nfs server kernel has nothing cached, then when a request arrives
you should first see a transaction of fd 3
(/proc/1926/net/rpc/auth.unix.ip/channel) to map the IP address to a
"domain".
Then a transaction on fd 5 (nfsd.fh) to map the domain + a filehandle
fragment to an export point.
Then a transaction on fd 4 (nfsd.export) to map the domain + export
point to some export options.

If the request is a "LOOKUP", you might see another "fd 5" lookup after
the "fd 4" to find out if a sub-mountpoint is exported.  That is what
the old strace showed.

If you don't see all of those, then something is cached.

Could you repeat your experiments after first running "exportfs -f" on
the nfs server?  That should cause worst-case load on mountd.

Until we see the "read(5).... write(5)" sequence completing quickly, I
cannot be sure that we have fixed anything.  Actually.. for future
traces, please add "-tt" so I can see where the delays are, and how much
they are.

(and thanks for providing all the tracing details - I love getting
unambiguous data!!)

Thanks,
NeilBrown

Phil Kauffman July 11, 2017, 3:39 p.m. UTC | #5

On 07/10/2017 06:51 PM, NeilBrown wrote:
> This does look encouraging ... but I'm not sure we are comparing apples
> with apples.
I had a feeling this was the case.


> Could you repeat your experiments after first running "exportfs -f" on
> the nfs server?  That should cause worst-case load on mountd.
The new strace with '-tt': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/2017-07-11_100929_strace.txt
SSH test: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/ssh_output.txt

nfsserver# exportfs -f; m=$(pgrep rpc.mountd); strace -tt -p ${m} 2>&1 | tee `date +%F_%T`_strace.txt


# cut -d' ' -f2 2017-07-11_100929_strace.txt| cut -d'(' -f1 | sort | uniq -c | sort -n
      1 Process
     73 select
    143 write
    216 close
    216 fstat
    216 open
  48240 read
 435599 stat
 871162 statfs
 871163 lstat


> (and thanks for providing all the tracing details - I love getting
> unambiguous data!!)
No problem, I'm available to run any necessary tests. I also appreciate your help on this.

NeilBrown July 12, 2017, 12:46 a.m. UTC | #6

On Tue, Jul 11 2017, Phil Kauffman wrote:

> On 07/10/2017 06:51 PM, NeilBrown wrote:
>> This does look encouraging ... but I'm not sure we are comparing apples
>> with apples.
> I had a feeling this was the case.
>
>
>> Could you repeat your experiments after first running "exportfs -f" on
>> the nfs server?  That should cause worst-case load on mountd.
> The new strace with '-tt': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/2017-07-11_100929_strace.txt
> SSH test: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/ssh_output.txt
>
> nfsserver# exportfs -f; m=$(pgrep rpc.mountd); strace -tt -p ${m} 2>&1 | tee `date +%F_%T`_strace.txt
>
>
> # cut -d' ' -f2 2017-07-11_100929_strace.txt| cut -d'(' -f1 | sort | uniq -c | sort -n
>       1 Process
>      73 select
>     143 write
>     216 close
>     216 fstat
>     216 open
>   48240 read
>  435599 stat
>  871162 statfs
>  871163 lstat
>
>
>> (and thanks for providing all the tracing details - I love getting
>> unambiguous data!!)
> No problem, I'm available to run any necessary tests. I also appreciate your help on this.
>

So the new data shows about 7 seconds for a login, which is probably
a little longer than you would like, but might be acceptable?
Each login triggers 3 reads of /etc/mtab, and each read  (from 'open' to
'close') takes either 3-4 seconds or about 0.2 seconds.
Actually reading /etc/mtab seems to take about 0.1-0.2 seconds at most.
The rest of the time is calling stat, with a pattern like

10:09:39.547280 stat("/tank/homes/someuser", {st_mode=S_IFDIR|0750, st_size=52, ...}) = 0
10:09:39.547318 lstat("/tank/homes/someuser", {st_mode=S_IFDIR|0750, st_size=52, ...}) = 0
10:09:39.547350 lstat("/tank/homes/someuser"..", {st_mode=S_IFDIR|0755, st_size=5967, ...}) = 0
10:09:39.547385 statfs("/tank/homes/someuser", {f_type=0x2fc12fc1, f_bsize=131072, f_blocks=40960, f_bfree=38830, f_ba
10:09:39.547424 statfs("/tank/homes/someuser", {f_type=0x2fc12fc1, f_bsize=131072, f_blocks=40960, f_bfree=38830, f_ba

Of the three passes through /etc/mtab for each login, one generates
95 of these, one generates 6049, and one generates 5952.

The 'stat' is from match_fsid()
The 2 'lstat' are from is_mountpoint() (called by match_fsid()
The first statfs() is from uuid_by_path()
Maybe the second is somewhere in the blkid code.

Probably the next step is to cache the mapping from
  path, fsidtype -> fsid.

That is more than a 2-line patch.  I might have a go later this week.

NeilBrown

Phil Kauffman July 12, 2017, 5:55 p.m. UTC | #7

On 07/11/2017 07:46 PM, NeilBrown wrote:
> So the new data shows about 7 seconds for a login, which is probably
> a little longer than you would like, but might be acceptable?
Unfortunately, the delay is not acceptable.

The ideal would be to achieve performance parity with when one is not 
forced to use the 'crossmnt' option.

My current setup (home directory server) does not require 'crossmnt' and 
does not incur a delay. It is a standard nfs server using mdadm, lvm, 
and xfs.

While my current setup is probably "normal" and using nested datasets 
with the 'crossmnt' option might be "weird" now; nested mounts will 
probably only become more common as BTRFS, ZFS, and other filesystems 
with similar features gain traction on linux.

> That is more than a 2-line patch.  I might have a go later this week.
That would be super! Thanks for taking a look.

/etc/mtab read ~900 times by rpc.mountd

Commit Message

Comments

Patch