Message ID | 87efto69rs.fsf@notabene.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jul 10, 2017 at 05:17:27PM +1000, NeilBrown wrote: > On Fri, Jul 07 2017, Phil Kauffman wrote: > > >> I can imagine /etc/mtab being read once for every line in /etc/exports, > >> but unless your /etc/exports is very big, I can't easily see why it > >> would be read 900 times. > >> Maybe lots of different mount points are being accessed by something and > >> each one triggers a few reads... > >> > >> Can you show your /etc/exports file? > > > > These get generated via puppet and dropped into /etc/exports.d/ > ... > > > > Here we can see that there are 960 files in /etc/exports.d > > root@storage1:~# find /etc/exports.d/ -type f | wc > > 960 960 88701 > > Ahhh.. 960 exports. That could do it. > > $ grep -c crossmnt all_etc_exports.d_in_one_file.txt > 957 > > I get 957 - much the same number. > > $ grep open redacted_strace.txt | uniq -c > 2 open("/var/lib/nfs/etab", O_RDONLY) = 6 > 942 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 6 > 1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6 > > /etc/mtab was ready 942 for a single access. I would have expected 957, > but maybe the system is dynamic and something changed between the two > samples. > > This makes it fairly clear what is happening. Now we just need to fix > it. > One option would be to cache some of the details extracted from > /etc/mtab, but that could get messy. Would be possible to read /proc/self/mounts only once and use poll() to detect changes? The issue is not only the open()+read() in the process, but also kernel has to compose the file from the mount tree. It's expensive. Karel
On 07/10/2017 02:17 AM, NeilBrown wrote: > Another is to move the code around. In your case there are really just > 3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in > the combined exports file is really different in different files). > So any one client only needs to consider 3 exports, not 300. This assumption is correct. We have ~300 clients and growing (slowly). > Could you test with this patch applied and see what > difference it makes? After being confused by conflicting test results and determining that the service nfs-server takes a bit longer to start than 'systemctl' will let you believe, I believe Niel's patch works. The new cache.c file I am using to test with: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/cache.c It also contains the official patch from this thread: http://marc.info/?l=linux-nfs&m=138900709311582&w=2 Building the new deb packages: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/build_deb.txt Install nfs-kernel-server and nfs-common debs on server and nfs-common on client. I reboot everything: client and server Run ssh for loop (no stracing) with the result that every user was able to ssh in under a second (I consider this to be a success). http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_result_no-strace.txt I then used a larger sample size of users (~/f) and got a similar result. Now lets strace rpc.mountd while sshing with one user: $ time ssh -o BatchMode=yes $CLIENTHOSTNAME -l user_not_in_list_f Permission denied (publickey,password). real 0m0.352s user 0m0.020s sys 0m0.012s The resulting strace seems entirely reasonable: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_one_user_strace.txt $ for user in $(cat ~/users | grep -E "^g*"); do time ssh -o BatchMode=yes $CLIENTHOSTNAME -l $user; done The first few dozen users whos usernames happen to start with 'g': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_multiple_user_strace.txt
On Mon, Jul 10 2017, Karel Zak wrote: > On Mon, Jul 10, 2017 at 05:17:27PM +1000, NeilBrown wrote: >> On Fri, Jul 07 2017, Phil Kauffman wrote: >> >> >> I can imagine /etc/mtab being read once for every line in /etc/exports, >> >> but unless your /etc/exports is very big, I can't easily see why it >> >> would be read 900 times. >> >> Maybe lots of different mount points are being accessed by something and >> >> each one triggers a few reads... >> >> >> >> Can you show your /etc/exports file? >> > >> > These get generated via puppet and dropped into /etc/exports.d/ >> ... >> > >> > Here we can see that there are 960 files in /etc/exports.d >> > root@storage1:~# find /etc/exports.d/ -type f | wc >> > 960 960 88701 >> >> Ahhh.. 960 exports. That could do it. >> >> $ grep -c crossmnt all_etc_exports.d_in_one_file.txt >> 957 >> >> I get 957 - much the same number. >> >> $ grep open redacted_strace.txt | uniq -c >> 2 open("/var/lib/nfs/etab", O_RDONLY) = 6 >> 942 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 6 >> 1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6 >> >> /etc/mtab was ready 942 for a single access. I would have expected 957, >> but maybe the system is dynamic and something changed between the two >> samples. >> >> This makes it fairly clear what is happening. Now we just need to fix >> it. >> One option would be to cache some of the details extracted from >> /etc/mtab, but that could get messy. > > Would be possible to read /proc/self/mounts only once and use poll() > to detect changes? Sure it is possible. All we need is code.... > > The issue is not only the open()+read() in the process, but also > kernel has to compose the file from the mount tree. It's expensive. That certainly contributes to the problem. If we cache the content, we can process it fast. If we avoid wanting to look at the content, it doesn't matter how fast processes it is (because we don't). So yes, that might be part of the solution. But not the whole solution. Thanks, NeilBrown > > Karel > > > -- > Karel Zak <kzak@redhat.com> > http://karelzak.blogspot.com
On Mon, Jul 10 2017, Phil Kauffman wrote: > On 07/10/2017 02:17 AM, NeilBrown wrote: >> Another is to move the code around. In your case there are really just >> 3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in >> the combined exports file is really different in different files). >> So any one client only needs to consider 3 exports, not 300. > This assumption is correct. We have ~300 clients and growing (slowly). > >> Could you test with this patch applied and see what >> difference it makes? > After being confused by conflicting test results and determining that the service nfs-server takes a bit longer to start than 'systemctl' will let you believe, I believe Niel's patch works. > > > The new cache.c file I am using to test with: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/cache.c > It also contains the official patch from this thread: http://marc.info/?l=linux-nfs&m=138900709311582&w=2 > > Building the new deb packages: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/build_deb.txt > > Install nfs-kernel-server and nfs-common debs on server and nfs-common on client. > > > I reboot everything: client and server > > Run ssh for loop (no stracing) with the result that every user was able to ssh in under a second (I consider this to be a success). This does look encouraging ... but I'm not sure we are comparing apples with apples. In the original strace we see. read(4, ....) check on etab statfs(some path) write(4,.... which is exactly like the new strace. But then the old strace also has read(5, ...) check on etab read /etc/mtab hundreds of times. write(5, ...) which is completely missing in the new strace. So the new code which doesn't read /etc/mtab as much, isn't being run at all. file desc 4 is /proc/1926/net/rpc/nfsd.export/channel file desc 5 is /proc/1926/net/rpc/nfsd.fh/channel If the nfs server kernel has nothing cached, then when a request arrives you should first see a transaction of fd 3 (/proc/1926/net/rpc/auth.unix.ip/channel) to map the IP address to a "domain". Then a transaction on fd 5 (nfsd.fh) to map the domain + a filehandle fragment to an export point. Then a transaction on fd 4 (nfsd.export) to map the domain + export point to some export options. If the request is a "LOOKUP", you might see another "fd 5" lookup after the "fd 4" to find out if a sub-mountpoint is exported. That is what the old strace showed. If you don't see all of those, then something is cached. Could you repeat your experiments after first running "exportfs -f" on the nfs server? That should cause worst-case load on mountd. Until we see the "read(5).... write(5)" sequence completing quickly, I cannot be sure that we have fixed anything. Actually.. for future traces, please add "-tt" so I can see where the delays are, and how much they are. (and thanks for providing all the tracing details - I love getting unambiguous data!!) Thanks, NeilBrown
On 07/10/2017 06:51 PM, NeilBrown wrote: > This does look encouraging ... but I'm not sure we are comparing apples > with apples. I had a feeling this was the case. > Could you repeat your experiments after first running "exportfs -f" on > the nfs server? That should cause worst-case load on mountd. The new strace with '-tt': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/2017-07-11_100929_strace.txt SSH test: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/ssh_output.txt nfsserver# exportfs -f; m=$(pgrep rpc.mountd); strace -tt -p ${m} 2>&1 | tee `date +%F_%T`_strace.txt # cut -d' ' -f2 2017-07-11_100929_strace.txt| cut -d'(' -f1 | sort | uniq -c | sort -n 1 Process 73 select 143 write 216 close 216 fstat 216 open 48240 read 435599 stat 871162 statfs 871163 lstat > (and thanks for providing all the tracing details - I love getting > unambiguous data!!) No problem, I'm available to run any necessary tests. I also appreciate your help on this.
On Tue, Jul 11 2017, Phil Kauffman wrote: > On 07/10/2017 06:51 PM, NeilBrown wrote: >> This does look encouraging ... but I'm not sure we are comparing apples >> with apples. > I had a feeling this was the case. > > >> Could you repeat your experiments after first running "exportfs -f" on >> the nfs server? That should cause worst-case load on mountd. > The new strace with '-tt': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/2017-07-11_100929_strace.txt > SSH test: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/ssh_output.txt > > nfsserver# exportfs -f; m=$(pgrep rpc.mountd); strace -tt -p ${m} 2>&1 | tee `date +%F_%T`_strace.txt > > > # cut -d' ' -f2 2017-07-11_100929_strace.txt| cut -d'(' -f1 | sort | uniq -c | sort -n > 1 Process > 73 select > 143 write > 216 close > 216 fstat > 216 open > 48240 read > 435599 stat > 871162 statfs > 871163 lstat > > >> (and thanks for providing all the tracing details - I love getting >> unambiguous data!!) > No problem, I'm available to run any necessary tests. I also appreciate your help on this. > So the new data shows about 7 seconds for a login, which is probably a little longer than you would like, but might be acceptable? Each login triggers 3 reads of /etc/mtab, and each read (from 'open' to 'close') takes either 3-4 seconds or about 0.2 seconds. Actually reading /etc/mtab seems to take about 0.1-0.2 seconds at most. The rest of the time is calling stat, with a pattern like 10:09:39.547280 stat("/tank/homes/someuser", {st_mode=S_IFDIR|0750, st_size=52, ...}) = 0 10:09:39.547318 lstat("/tank/homes/someuser", {st_mode=S_IFDIR|0750, st_size=52, ...}) = 0 10:09:39.547350 lstat("/tank/homes/someuser"..", {st_mode=S_IFDIR|0755, st_size=5967, ...}) = 0 10:09:39.547385 statfs("/tank/homes/someuser", {f_type=0x2fc12fc1, f_bsize=131072, f_blocks=40960, f_bfree=38830, f_ba 10:09:39.547424 statfs("/tank/homes/someuser", {f_type=0x2fc12fc1, f_bsize=131072, f_blocks=40960, f_bfree=38830, f_ba Of the three passes through /etc/mtab for each login, one generates 95 of these, one generates 6049, and one generates 5952. The 'stat' is from match_fsid() The 2 'lstat' are from is_mountpoint() (called by match_fsid() The first statfs() is from uuid_by_path() Maybe the second is somewhere in the blkid code. Probably the next step is to cache the mapping from path, fsidtype -> fsid. That is more than a 2-line patch. I might have a go later this week. NeilBrown
On 07/11/2017 07:46 PM, NeilBrown wrote: > So the new data shows about 7 seconds for a login, which is probably > a little longer than you would like, but might be acceptable? Unfortunately, the delay is not acceptable. The ideal would be to achieve performance parity with when one is not forced to use the 'crossmnt' option. My current setup (home directory server) does not require 'crossmnt' and does not incur a delay. It is a standard nfs server using mdadm, lvm, and xfs. While my current setup is probably "normal" and using nested datasets with the 'crossmnt' option might be "weird" now; nested mounts will probably only become more common as BTRFS, ZFS, and other filesystems with similar features gain traction on linux. > That is more than a 2-line patch. I might have a go later this week. That would be super! Thanks for taking a look.
diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c index ca6c84f4d93d..e712cc166157 100644 --- a/utils/mountd/cache.c +++ b/utils/mountd/cache.c @@ -727,6 +727,17 @@ static void nfsd_fh(int f) for (exp = exportlist[i].p_head; exp; exp = next_exp) { char *path; + if (!is_ipaddr_client(dom) + && !namelist_client_matches(exp, dom)) { + next_exp = exp->m_next; + continue; + } + if (is_ipaddr_client(dom) + && !ipaddr_client_matches(exp, ai)) { + next_exp = exp->m_next; + continue; + } + if (exp->m_export.e_flags & NFSEXP_CROSSMOUNT) { static nfs_export *prev = NULL; static void *mnt = NULL; @@ -751,9 +762,6 @@ static void nfsd_fh(int f) next_exp = exp->m_next; } - if (!is_ipaddr_client(dom) - && !namelist_client_matches(exp, dom)) - continue; if (exp->m_export.e_mountpoint && !is_mountpoint(exp->m_export.e_mountpoint[0]? exp->m_export.e_mountpoint: @@ -762,9 +770,6 @@ static void nfsd_fh(int f) if (!match_fsid(&parsed, exp, path)) continue; - if (is_ipaddr_client(dom) - && !ipaddr_client_matches(exp, ai)) - continue; if (!found || subexport(&exp->m_export, found)) { found = &exp->m_export; free(found_path);