Message ID | 1546812868-11794-34-git-send-email-jsimmons@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | lustre: add PFL support | expand |
On Sun, Jan 06 2019, James Simmons wrote: > With the majority of missing patches and features from the lustre > 2.10 release merged upstream its time to update the upstream > client's version. :-) Thanks to some of these patches (this batch or previous) I have fewer failing tests now .. those not many fewer. The current summary is 45 status: FAIL 556 status: PASS 47 status: SKIP It used to be >50 FAIL. The failing tests are listed below. I know why the FID patches fail - we've discussed that. Maybe it is time to start working out why some of the others are failing. Your two recent series are in my lustre-testing branch now - thanks. NeilBrown sanity: FAIL: test_27G 'testpool' is not empty sanity: FAIL: test_56w /root/lustre-release/lustre/utils/lfs getstripe -c /mnt/lustre/d56w.sanity/file1 wrong: found 2, expected 1 sanity: FAIL: test_56x migrate failed rc = 11 sanity: FAIL: test_56xa migrate failed rc = 11 sanity: FAIL: test_56z /root/lustre-release/lustre/utils/lfs find did not continue after error sanity: FAIL: test_56aa lfs find --size wrong under striped dir sanity: FAIL: test_56ca create /mnt/lustre/d56ca.sanity/f56ca.sanity- failed sanity: FAIL: test_64b oos.sh failed: 1 sanity: FAIL: test_102c setstripe failed sanity: FAIL: test_102j file1-0-1: size != 65536 sanity: FAIL: test_103a misc test failed sanity: FAIL: test_104b lfs check servers test failed sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed sanity: FAIL: test_160a changelog 'f160a.sanity' fid != file fid [0x240002342:0xd:0x0] sanity: FAIL: test_161d cat failed sanity: FAIL: test_205 No jobstats for id.205.mkdir.9480 found on mds1::*.lustre-MDT0000.job_stats sanity: FAIL: test_208 get lease error sanity: FAIL: test_225a mds-survey with zero-stripe failed sanity: FAIL: test_225b mds-survey with stripe_count failed sanity: FAIL: test_233a cannot access /mnt/lustre using its FID '[0x200000007:0x1:0x0]' sanity: FAIL: test_233b cannot access /mnt/lustre/.lustre using its FID '[0x200000002:0x1:0x0]' sanity: FAIL: test_255c Ladvise test10 failed, 255 sanity: FAIL: test_270a Can't create DoM layout sanity: FAIL: test_270c bad pattern sanity: FAIL: test_270e lfs find -L: found 1, expected 20 sanity: FAIL: test_270f Can't create file with 262144 DoM stripe sanity: FAIL: test_271c Too few enqueues , expected > 2000 sanity: FAIL: test_271f expect 1 READ RPC, occured sanity: FAIL: test_300g create striped_dir failed sanity: FAIL: test_300n create striped dir fails with gid=-1 sanity: FAIL: test_300q create d300q.sanity fails sanity: FAIL: test_315 read is not accounted () sanity: FAIL: test_317 Expected Block 4096 got 10240 for f317.sanity sanity: FAIL: test_405 One layout swap locked test failed sanity: FAIL: test_406 mkdir d406.sanity failed sanity: FAIL: test_409 Fail to cleanup the env! sanity: FAIL: test_410 no inode match sanity: FAIL: test_412 mkdir failed sanity: FAIL: test_413 don't expect 1 sanity: FAIL: test_802 (5) Mount client with 'ro' should succeed
> On Sun, Jan 06 2019, James Simmons wrote: > > > With the majority of missing patches and features from the lustre > > 2.10 release merged upstream its time to update the upstream > > client's version. > > :-) > > Thanks to some of these patches (this batch or previous) I have fewer > failing tests now .. those not many fewer. > > The current summary is > 45 status: FAIL > 556 status: PASS > 47 status: SKIP > > It used to be >50 FAIL. > > The failing tests are listed below. > I know why the FID patches fail - we've discussed that. > Maybe it is time to start working out why some of the others are > failing. You are running a much newer test suite. Using the test suite from lustre 2.10 I see the following failures. sanity: FAIL: test_103a run_acl_subtest cp failed (real failure) sanity: FAIL: test_215 cannot read lnet.stats (not sysfs aware) sanity: FAIL: test_233a cannot access /lustre/lustre using its FID '[0x200000007:0x1:0x0]' sanity: FAIL: test_233b cannot access /lustre/lustre/.lustre using its FID '[0x200000002:0x1:0x0]' sanity: FAIL: test_256 Changelog catalog has wrong number of slots 0 (fails for 2.10 LTS release as well) > Your two recent series are in my lustre-testing branch now - thanks. > > NeilBrown > > > sanity: FAIL: test_27G 'testpool' is not empty See LU-11208. Test currently with older lustre versions. > sanity: FAIL: test_56w /root/lustre-release/lustre/utils/lfs getstripe -c /mnt/lustre/d56w.sanity/file1 wrong: found 2, expected 1 > sanity: FAIL: test_56x migrate failed rc = 11 > sanity: FAIL: test_56xa migrate failed rc = 11 > sanity: FAIL: test_56z /root/lustre-release/lustre/utils/lfs find did not continue after error > sanity: FAIL: test_56aa lfs find --size wrong under striped dir > sanity: FAIL: test_56ca create /mnt/lustre/d56ca.sanity/f56ca.sanity- failed > sanity: FAIL: test_64b oos.sh failed: 1 > sanity: FAIL: test_102c setstripe failed > sanity: FAIL: test_102j file1-0-1: size != 65536 I believe these are due to the DoM feature missing > sanity: FAIL: test_103a misc test failed 103a is real failure. Never solved yet. (LU-11594 and LU-10334 for Ubuntu) > sanity: FAIL: test_104b lfs check servers test failed sysfs bug. I have a patch for this. > sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed > sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed > sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed > sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed > sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed What version of e2fsprog are you running? You need a 1.44 version and this should go away. > sanity: FAIL: test_160a changelog 'f160a.sanity' fid != file fid [0x240002342:0xd:0x0] > sanity: FAIL: test_161d cat failed Might be missing some more changelog improvements. > sanity: FAIL: test_205 No jobstats for id.205.mkdir.9480 found on mds1::*.lustre-MDT0000.job_stats Strange? > sanity: FAIL: test_208 get lease error > sanity: FAIL: test_225a mds-survey with zero-stripe failed > sanity: FAIL: test_225b mds-survey with stripe_count failed Never ran that since its not in 2.10. > sanity: FAIL: test_233a cannot access /mnt/lustre using its FID '[0x200000007:0x1:0x0]' > sanity: FAIL: test_233b cannot access /mnt/lustre/.lustre using its FID '[0x200000002:0x1:0x0]' > sanity: FAIL: test_255c Ladvise test10 failed, 255 > sanity: FAIL: test_270a Can't create DoM layout > sanity: FAIL: test_270c bad pattern > sanity: FAIL: test_270e lfs find -L: found 1, expected 20 > sanity: FAIL: test_270f Can't create file with 262144 DoM stripe > sanity: FAIL: test_271c Too few enqueues , expected > 2000 > sanity: FAIL: test_271f expect 1 READ RPC, occured > sanity: FAIL: test_300g create striped_dir failed > sanity: FAIL: test_300n create striped dir fails with gid=-1 > sanity: FAIL: test_300q create d300q.sanity fails > sanity: FAIL: test_315 read is not accounted () > sanity: FAIL: test_317 Expected Block 4096 got 10240 for f317.sanity > sanity: FAIL: test_405 One layout swap locked test failed > sanity: FAIL: test_406 mkdir d406.sanity failed > sanity: FAIL: test_409 Fail to cleanup the env! More DoM issues? Could be FLR as well if you are running the latest test suite. > sanity: FAIL: test_410 no inode match This is a weird test running a local kernel module. > sanity: FAIL: test_412 mkdir failed > sanity: FAIL: test_413 don't expect 1 More DoM ???? Have to look at this. > sanity: FAIL: test_802 (5) Mount client with 'ro' should succeed Is test is broken. It assumes you have a specially patched kernel. Details are under ticket LU-684. The nice thing is with the linux client is that we are at a point it wouldn't be a huge leap to integrate DoM (Data on MetaData). The reason I suggest cleanups and moving out of staging first was to perserve git blame a bit better with future patches. Currently we see a lot of "0846e85ba2346 (NeilBrown 2018-06-07" with git blame.
On Jan 7, 2019, at 21:26, James Simmons <jsimmons@infradead.org> wrote: > >> >> On Sun, Jan 06 2019, James Simmons wrote: >> >>> With the majority of missing patches and features from the lustre >>> 2.10 release merged upstream its time to update the upstream >>> client's version. >> >> :-) >> >> Thanks to some of these patches (this batch or previous) I have fewer >> failing tests now .. those not many fewer. >> >> The current summary is >> 45 status: FAIL >> 556 status: PASS >> 47 status: SKIP >> >> It used to be >50 FAIL. >> >> The failing tests are listed below. >> I know why the FID patches fail - we've discussed that. >> Maybe it is time to start working out why some of the others are >> failing. > > You are running a much newer test suite. Using the test suite from lustre > 2.10 I see the following failures. > > sanity: FAIL: test_103a run_acl_subtest cp failed (real failure) > sanity: FAIL: test_215 cannot read lnet.stats (not sysfs aware) > sanity: FAIL: test_233a cannot access /lustre/lustre using its FID '[0x200000007:0x1:0x0]' > sanity: FAIL: test_233b cannot access /lustre/lustre/.lustre using its FID '[0x200000002:0x1:0x0]' > sanity: FAIL: test_256 Changelog catalog has wrong number of slots 0 (fails for 2.10 LTS release as well) Yes, there are definitely some tests that do not have proper client/server version/feature checks, since the tests are introduced with the code they are testing. There are a number of patches in Gerrit that are adding the proper checks that I'd like to get landed, because we do run client/server version interop testing, but they always lag a bit behind and we never see test-script/client version issues in our testing. >> Your two recent series are in my lustre-testing branch now - thanks. >> >> NeilBrown >> >> >> sanity: FAIL: test_27G 'testpool' is not empty > > See LU-11208. Test currently with older lustre versions. > >> sanity: FAIL: test_56w /root/lustre-release/lustre/utils/lfs getstripe -c /mnt/lustre/d56w.sanity/file1 wrong: found 2, expected 1 >> sanity: FAIL: test_56x migrate failed rc = 11 >> sanity: FAIL: test_56xa migrate failed rc = 11 >> sanity: FAIL: test_56z /root/lustre-release/lustre/utils/lfs find did not continue after error >> sanity: FAIL: test_56aa lfs find --size wrong under striped dir >> sanity: FAIL: test_56ca create /mnt/lustre/d56ca.sanity/f56ca.sanity- failed >> sanity: FAIL: test_64b oos.sh failed: 1 >> sanity: FAIL: test_102c setstripe failed >> sanity: FAIL: test_102j file1-0-1: size != 65536 > > I believe these are due to the DoM feature missing > >> sanity: FAIL: test_103a misc test failed > > 103a is real failure. Never solved yet. (LU-11594 and LU-10334 for Ubuntu) > >> sanity: FAIL: test_104b lfs check servers test failed > > sysfs bug. I have a patch for this. > >> sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed >> sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed >> sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed >> sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed >> sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed > > What version of e2fsprog are you running? You need a 1.44 version and > this should go away. To be clear - the Lustre-patched "filefrag" at: https://downloads.whamcloud.com/public/e2fsprogs/1.44.3.wc1/ Once Lustre gets into upstream, or convince another filesystem to use the Lustre filefrag extension (multiple devices, which BtrFS and XFS could use) we can get the support landed into the upstream e2fsprogs. >> sanity: FAIL: test_160a changelog 'f160a.sanity' fid != file fid [0x240002342:0xd:0x0] >> sanity: FAIL: test_161d cat failed > > Might be missing some more changelog improvements. > >> sanity: FAIL: test_205 No jobstats for id.205.mkdir.9480 found on mds1::*.lustre-MDT0000.job_stats > > Strange? This might be because the upstream Lustre doesn't allow setting per-process JobID via environment variable, only as a single per-node value. The real unfortunate part is that the "get JobID from environment" actually works for every reasonable architecture (even the one which was originally broken fixed it), but it got yanked anyway. This is actually one of the features of Lustre that lots of HPC sites like to use, since it allows them to track on the servers which users/jobs/processes on the client are doing IO. >> sanity: FAIL: test_208 get lease error >> sanity: FAIL: test_225a mds-survey with zero-stripe failed >> sanity: FAIL: test_225b mds-survey with stripe_count failed > > Never ran that since its not in 2.10. > >> sanity: FAIL: test_233a cannot access /mnt/lustre using its FID '[0x200000007:0x1:0x0]' >> sanity: FAIL: test_233b cannot access /mnt/lustre/.lustre using its FID '[0x200000002:0x1:0x0]' > >> sanity: FAIL: test_255c Ladvise test10 failed, 255 >> sanity: FAIL: test_270a Can't create DoM layout >> sanity: FAIL: test_270c bad pattern >> sanity: FAIL: test_270e lfs find -L: found 1, expected 20 >> sanity: FAIL: test_270f Can't create file with 262144 DoM stripe >> sanity: FAIL: test_271c Too few enqueues , expected > 2000 >> sanity: FAIL: test_271f expect 1 READ RPC, occured >> sanity: FAIL: test_300g create striped_dir failed >> sanity: FAIL: test_300n create striped dir fails with gid=-1 >> sanity: FAIL: test_300q create d300q.sanity fails >> sanity: FAIL: test_315 read is not accounted () >> sanity: FAIL: test_317 Expected Block 4096 got 10240 for f317.sanity >> sanity: FAIL: test_405 One layout swap locked test failed >> sanity: FAIL: test_406 mkdir d406.sanity failed >> sanity: FAIL: test_409 Fail to cleanup the env! > > More DoM issues? Could be FLR as well if you are running the latest > test suite. > >> sanity: FAIL: test_410 no inode match > > This is a weird test running a local kernel module. > >> sanity: FAIL: test_412 mkdir failed >> sanity: FAIL: test_413 don't expect 1 > > More DoM ???? Have to look at this. > >> sanity: FAIL: test_802 (5) Mount client with 'ro' should succeed > > Is test is broken. It assumes you have a specially patched kernel. > Details are under ticket LU-684. > > The nice thing is with the linux client is that we are at a point > it wouldn't be a huge leap to integrate DoM (Data on MetaData). > The reason I suggest cleanups and moving out of staging first was > to perserve git blame a bit better with future patches. Currently > we see a lot of "0846e85ba2346 (NeilBrown 2018-06-07" with git blame. Cheers, Andreas --- Andreas Dilger CTO Whamcloud
> >> sanity: FAIL: test_104b lfs check servers test failed > > > > sysfs bug. I have a patch for this. > > > >> sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed > >> sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed > >> sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed > >> sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed > >> sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed > > > > What version of e2fsprog are you running? You need a 1.44 version and > > this should go away. > > To be clear - the Lustre-patched "filefrag" at: > > https://downloads.whamcloud.com/public/e2fsprogs/1.44.3.wc1/ > > Once Lustre gets into upstream, or convince another filesystem to use the > Lustre filefrag extension (multiple devices, which BtrFS and XFS could > use) we can get the support landed into the upstream e2fsprogs. I swore that Ubuntu18 testing passed with the default e2fsprogs (1.44.4). To let Neil know, this is why lustre_fiemap.h exist in the uapi headers directory. This kind of functionality would help the community at large. > >> sanity: FAIL: test_160a changelog 'f160a.sanity' fid != file fid [0x240002342:0xd:0x0] > >> sanity: FAIL: test_161d cat failed > > > > Might be missing some more changelog improvements. > > > >> sanity: FAIL: test_205 No jobstats for id.205.mkdir.9480 found on mds1::*.lustre-MDT0000.job_stats > > > > Strange? > > This might be because the upstream Lustre doesn't allow setting per-process > JobID via environment variable, only as a single per-node value. The real > unfortunate part is that the "get JobID from environment" actually works for > every reasonable architecture (even the one which was originally broken > fixed it), but it got yanked anyway. This is actually one of the features > of Lustre that lots of HPC sites like to use, since it allows them to track > on the servers which users/jobs/processes on the client are doing IO. To give background for Neil see thread: https://lore.kernel.org/patchwork/patch/416846 In this case I do agree with Greg. The latest jobid does implement an upcall and upcalls don't play niece with containers. Their is also the namespace issue pointed out. I think the namespace issue might be fixed in the latest OpenSFS code. The whole approach to stats in lustre is pretty awful. Take jobstats for example. Currently the approach is to poll inside the kernel at specific intervals. Part of the polling is scanning the running processes environment space. On top of this the administor ends up creating scripts to poll the proc / debugfs entry. Other types of lustre stat files take a similar approach. Scripts have to poll debugfs / procfs entries. I have been thinking what would be a better approach since I like to approach this problem for the 2.13 time frame. Our admins at my work place want to be able to collect application stats without being root. So placing stats in debugfs is not an option, which we currently do the linux client :-( The stats are not a good fit for sysfs. The solution I have been pondering is using netlink. Since netlink is socket based it can be treated as a pipe. Now you are thinking well you still need to poll on the netlink socket but you don't have too. systemd does it for you :-) We can create systemd service file which uses ListenNetlink=generic "multicast group" ... to launch a service to collect the stats. As for job stats we use another type of netlink class called process connector. The below link cover this little know feature. It is available on every system in the support matrix. https://www.slideshare.net/kerneltlv/kernel-proc-connector-and-containers In this case we parsen for the job schedular ids and pass that to lustre using system.d.
On Jan 8, 2019, at 14:47, James Simmons <jsimmons@infradead.org> wrote: > >>>> sanity: FAIL: test_104b lfs check servers test failed >>> >>> sysfs bug. I have a patch for this. >>> >>>> sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed >>>> sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed >>>> sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed >>>> sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed >>>> sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed >>> >>> What version of e2fsprog are you running? You need a 1.44 version and >>> this should go away. >> >> To be clear - the Lustre-patched "filefrag" at: >> >> https://downloads.whamcloud.com/public/e2fsprogs/1.44.3.wc1/ >> >> Once Lustre gets into upstream, or convince another filesystem to use the >> Lustre filefrag extension (multiple devices, which BtrFS and XFS could >> use) we can get the support landed into the upstream e2fsprogs. > > I swore that Ubuntu18 testing passed with the default e2fsprogs (1.44.4). > To let Neil know, this is why lustre_fiemap.h exist in the uapi headers > directory. This kind of functionality would help the community at large. The returned data is identical for single-striped files, so the vanilla filefrag will work on Lustre in the common case. >>>> sanity: FAIL: test_160a changelog 'f160a.sanity' fid != file fid [0x240002342:0xd:0x0] >>>> sanity: FAIL: test_161d cat failed >>> >>> Might be missing some more changelog improvements. >>> >>>> sanity: FAIL: test_205 No jobstats for id.205.mkdir.9480 found on mds1::*.lustre-MDT0000.job_stats >>> >>> Strange? >> >> This might be because the upstream Lustre doesn't allow setting per-process >> JobID via environment variable, only as a single per-node value. The real >> unfortunate part is that the "get JobID from environment" actually works for >> every reasonable architecture (even the one which was originally broken >> fixed it), but it got yanked anyway. This is actually one of the features >> of Lustre that lots of HPC sites like to use, since it allows them to track >> on the servers which users/jobs/processes on the client are doing IO. > > To give background for Neil see thread: > > https://lore.kernel.org/patchwork/patch/416846 > > In this case I do agree with Greg. The latest jobid does implement an > upcall and upcalls don't play niece with containers. Their is also the > namespace issue pointed out. I think the namespace issue might be fixed > in the latest OpenSFS code. I'm not sure what you mean? AFAIK, there is no upcall for JobID, except maybe in the kernel client where we weren't allowed to parse the process environment directly. I agree an upcall is problematic with namespaces, in addition to being less functional (only a JobID per node instead of per process), which is why direct access to JOBENV is better IMHO. > The whole approach to stats in lustre is > pretty awful. Take jobstats for example. Currently the approach is > to poll inside the kernel at specific intervals. Part of the polling is > scanning the running processes environment space. On top of this the > administor ends up creating scripts to poll the proc / debugfs entry. > Other types of lustre stat files take a similar approach. Scripts have > to poll debugfs / procfs entries. I think that issue is orthogonal to getting the actual JobID. That is the stats collection from the kernel. We shouldn't be inventing a new way to process that. What does "top" do? Read a thousand /proc files every second because that is flexible for different use cases. There are much fewer Lustre stats files on a given node, and I haven't heard that the actual stats reading interface is a performance issue. > I have been thinking what would be a better approach since I like to > approach this problem for the 2.13 time frame. Our admins at my work > place want to be able to collect application stats without being root. > So placing stats in debugfs is not an option, which we currently do > the linux client :-( The stats are not a good fit for sysfs. The solution > I have been pondering is using netlink. Since netlink is socket based it > can be treated as a pipe. Now you are thinking well you still need to poll > on the netlink socket but you don't have too. systemd does it for you :-) > We can create systemd service file which uses For the love of all that is holy, do not make Lustre stats usage depend on Systemd to be usable. Cheers, Andreas --- Andreas Dilger Principal Lustre Architect Whamcloud
> >> This might be because the upstream Lustre doesn't allow setting per-process > >> JobID via environment variable, only as a single per-node value. The real > >> unfortunate part is that the "get JobID from environment" actually works for > >> every reasonable architecture (even the one which was originally broken > >> fixed it), but it got yanked anyway. This is actually one of the features > >> of Lustre that lots of HPC sites like to use, since it allows them to track > >> on the servers which users/jobs/processes on the client are doing IO. > > > > To give background for Neil see thread: > > > > https://lore.kernel.org/patchwork/patch/416846 > > > > In this case I do agree with Greg. The latest jobid does implement an > > upcall and upcalls don't play niece with containers. Their is also the > > namespace issue pointed out. I think the namespace issue might be fixed > > in the latest OpenSFS code. > > I'm not sure what you mean? AFAIK, there is no upcall for JobID, except > maybe in the kernel client where we weren't allowed to parse the process > environment directly. I agree an upcall is problematic with namespaces, > in addition to being less functional (only a JobID per node instead of > per process), which is why direct access to JOBENV is better IMHO. I have some evil ideas about this. Need to think about it some more since this is a more complex problem. > > The whole approach to stats in lustre is > > pretty awful. Take jobstats for example. Currently the approach is > > to poll inside the kernel at specific intervals. Part of the polling is > > scanning the running processes environment space. On top of this the > > administor ends up creating scripts to poll the proc / debugfs entry. > > Other types of lustre stat files take a similar approach. Scripts have > > to poll debugfs / procfs entries. > > I think that issue is orthogonal to getting the actual JobID. That is > the stats collection from the kernel. We shouldn't be inventing a new > way to process that. What does "top" do? Read a thousand /proc files > every second because that is flexible for different use cases. There > are much fewer Lustre stats files on a given node, and I haven't heard > that the actual stats reading interface is a performance issue. Because the policy for the linux kernel is not to add non processes related information in procfs anymore. "top" reads process information from procfs which is okay. This means the stats lustre generates are required to be placed in debugfs. The problem their is you need to be root to access this information. I told the administrator about this and they told me in no way will they run an application as root just to read stats. We really don't want to require users to mount their debugfs partitions to allow non root uses to access it. So I looked into alteranatives. Actually with netlink you have far more power for handling stats then polling some proc file. Also while for most cases the stat files are not huge in general but if we do end up having a stat seq_file with a huge amount of data then polling that file can really spike the load on an node. > > I have been thinking what would be a better approach since I like to > > approach this problem for the 2.13 time frame. Our admins at my work > > place want to be able to collect application stats without being root. > > So placing stats in debugfs is not an option, which we currently do > > the linux client :-( The stats are not a good fit for sysfs. The solution > > I have been pondering is using netlink. Since netlink is socket based it > > can be treated as a pipe. Now you are thinking well you still need to poll > > on the netlink socket but you don't have too. systemd does it for you :-) > > We can create systemd service file which uses > > For the love of all that is holy, do not make Lustre stats usage depend > on Systemd to be usable. I never write code that locks in one approach ever. Take for example the lctl conf_param / set_param -P handling with the move to sysfs. Instead of the old upcall method to lctl now we have a udev rule. That rule is not law!!! A site could create their own udev rule if they want to say log changes to the lustre tunables. Keep in mind udev rules need to be simple since they block until completed much like upcalls do. If you want to run a heavy application you can create a system.d service to handle the tunable uevent. If you are really clever you can use dbus to send that the tuning event to external nodes. Their are many creative options. The same is true with the stats netlink approach. I was up late last night pondering a design for the netlink stats. I have to put together a list of my ideas and run it by my admins. So no systemd is not a hard requirment. Just an option for people into that sort of thing. Using udev and netlink opens up a whole new stack to take advantage of.
On Jan 9, 2019, at 11:28, James Simmons <jsimmons@infradead.org> wrote: > > >>>> This might be because the upstream Lustre doesn't allow setting per-process >>>> JobID via environment variable, only as a single per-node value. The real >>>> unfortunate part is that the "get JobID from environment" actually works for >>>> every reasonable architecture (even the one which was originally broken >>>> fixed it), but it got yanked anyway. This is actually one of the features >>>> of Lustre that lots of HPC sites like to use, since it allows them to track >>>> on the servers which users/jobs/processes on the client are doing IO. >>> >>> To give background for Neil see thread: >>> >>> https://lore.kernel.org/patchwork/patch/416846 >>> >>> In this case I do agree with Greg. The latest jobid does implement an >>> upcall and upcalls don't play niece with containers. Their is also the >>> namespace issue pointed out. I think the namespace issue might be fixed >>> in the latest OpenSFS code. >> >> I'm not sure what you mean? AFAIK, there is no upcall for JobID, except >> maybe in the kernel client where we weren't allowed to parse the process >> environment directly. I agree an upcall is problematic with namespaces, >> in addition to being less functional (only a JobID per node instead of >> per process), which is why direct access to JOBENV is better IMHO. > > I have some evil ideas about this. Need to think about it some more since > this is a more complex problem. Since the kernel manages the environment variables via getenv() and setenv(), I honestly don't see why accessing them directly is a huge issue? >>> The whole approach to stats in lustre is >>> pretty awful. Take jobstats for example. Currently the approach is >>> to poll inside the kernel at specific intervals. Part of the polling is >>> scanning the running processes environment space. On top of this the >>> administor ends up creating scripts to poll the proc / debugfs entry. >>> Other types of lustre stat files take a similar approach. Scripts have >>> to poll debugfs / procfs entries. >> >> I think that issue is orthogonal to getting the actual JobID. That is >> the stats collection from the kernel. We shouldn't be inventing a new >> way to process that. What does "top" do? Read a thousand /proc files >> every second because that is flexible for different use cases. There >> are much fewer Lustre stats files on a given node, and I haven't heard >> that the actual stats reading interface is a performance issue. > > Because the policy for the linux kernel is not to add non processes > related information in procfs anymore. "top" reads process information > from procfs which is okay. The location of the stats (procfs vs. sysfs vs. debugfs) wasn't my point. My point was that a *very* core kernel performance monitoring utility is doing open/read/close from virtual kernel files, so before we go ahead and invent our own performance monitoring framework (which may be frowned upon by upstream for arbitrary reasons because it isn't using /proc or /sys files). > This means the stats lustre generates are > required to be placed in debugfs. The problem their is you need to be > root to access this information. That is a self-inflicted problem because of upstream kernel policy to move the existing files out of /proc and being unable to use /sys either. > I told the administrator about this > and they told me in no way will they run an application as root just > to read stats. We really don't want to require users to mount their > debugfs partitions to allow non root uses to access it. So I looked > into alteranatives. Actually with netlink you have far more power for > handling stats then polling some proc file. Also while for most cases the > stat files are not huge in general but if we do end up having a stat > seq_file with a huge amount of data then polling that file can really > spike the load on an node. I agree it is not ideal. One option (AFAIK) would be a udev rule that changes the /sys/kernel/debug/lustre/* files to be at readable by a non-root group (e.g. admin or perftools or whatever) for the collector. >>> I have been thinking what would be a better approach since I like to >>> approach this problem for the 2.13 time frame. Our admins at my work >>> place want to be able to collect application stats without being root. >>> So placing stats in debugfs is not an option, which we currently do >>> the linux client :-( The stats are not a good fit for sysfs. The solution >>> I have been pondering is using netlink. Since netlink is socket based it >>> can be treated as a pipe. Now you are thinking well you still need to poll >>> on the netlink socket but you don't have too. systemd does it for you :-) >>> We can create systemd service file which uses >> >> For the love of all that is holy, do not make Lustre stats usage depend >> on Systemd to be usable. > > I never write code that locks in one approach ever. Take for example the > lctl conf_param / set_param -P handling with the move to sysfs. Instead > of the old upcall method to lctl now we have a udev rule. That rule is not > law!!! A site could create their own udev rule if they want to say log > changes to the lustre tunables. Keep in mind udev rules need to be simple > since they block until completed much like upcalls do. If you want to > run a heavy application you can create a system.d service to handle the > tunable uevent. If you are really clever you can use dbus to send that > the tuning event to external nodes. Their are many creative options. > > The same is true with the stats netlink approach. I was up late last night > pondering a design for the netlink stats. I have to put together a list > of my ideas and run it by my admins. So no systemd is not a hard > requirment. Just an option for people into that sort of thing. Using > udev and netlink opens up a whole new stack to take advantage of. Sorry to be negative, but I was just having fun with systemd over the weekend on one of my home systems, and I really don't want to entangle it into our stats. If the existing procfs/sysfs/debugfs scraping will continue to work in the future then I'm fine with that. Cheers, Andreas --- Andreas Dilger Principal Lustre Architect Whamcloud
On Tue, Jan 08 2019, Andreas Dilger wrote: > On Jan 7, 2019, at 21:26, James Simmons <jsimmons@infradead.org> wrote: >> >>> >>> On Sun, Jan 06 2019, James Simmons wrote: >>> >>>> With the majority of missing patches and features from the lustre >>>> 2.10 release merged upstream its time to update the upstream >>>> client's version. >>> >>> :-) >>> >>> Thanks to some of these patches (this batch or previous) I have fewer >>> failing tests now .. those not many fewer. >>> >>> The current summary is >>> 45 status: FAIL >>> 556 status: PASS >>> 47 status: SKIP >>> >>> It used to be >50 FAIL. >>> >>> The failing tests are listed below. >>> I know why the FID patches fail - we've discussed that. >>> Maybe it is time to start working out why some of the others are >>> failing. >> >> You are running a much newer test suite. Using the test suite from lustre >> 2.10 I see the following failures. >> >> sanity: FAIL: test_103a run_acl_subtest cp failed (real failure) >> sanity: FAIL: test_215 cannot read lnet.stats (not sysfs aware) >> sanity: FAIL: test_233a cannot access /lustre/lustre using its FID '[0x200000007:0x1:0x0]' >> sanity: FAIL: test_233b cannot access /lustre/lustre/.lustre using its FID '[0x200000002:0x1:0x0]' >> sanity: FAIL: test_256 Changelog catalog has wrong number of slots 0 (fails for 2.10 LTS release as well) > > Yes, there are definitely some tests that do not have proper client/server version/feature checks, since the tests are introduced with the code they > are testing. There are a number of patches in Gerrit that are adding the > proper checks that I'd like to get landed, because we do run client/server > version interop testing, but they always lag a bit behind and we never see > test-script/client version issues in our testing. > >>> Your two recent series are in my lustre-testing branch now - thanks. >>> >>> NeilBrown >>> >>> >>> sanity: FAIL: test_27G 'testpool' is not empty >> >> See LU-11208. Test currently with older lustre versions. >> >>> sanity: FAIL: test_56w /root/lustre-release/lustre/utils/lfs getstripe -c /mnt/lustre/d56w.sanity/file1 wrong: found 2, expected 1 >>> sanity: FAIL: test_56x migrate failed rc = 11 >>> sanity: FAIL: test_56xa migrate failed rc = 11 >>> sanity: FAIL: test_56z /root/lustre-release/lustre/utils/lfs find did not continue after error >>> sanity: FAIL: test_56aa lfs find --size wrong under striped dir >>> sanity: FAIL: test_56ca create /mnt/lustre/d56ca.sanity/f56ca.sanity- failed >>> sanity: FAIL: test_64b oos.sh failed: 1 >>> sanity: FAIL: test_102c setstripe failed >>> sanity: FAIL: test_102j file1-0-1: size != 65536 >> >> I believe these are due to the DoM feature missing >> >>> sanity: FAIL: test_103a misc test failed >> >> 103a is real failure. Never solved yet. (LU-11594 and LU-10334 for Ubuntu) >> >>> sanity: FAIL: test_104b lfs check servers test failed >> >> sysfs bug. I have a patch for this. >> >>> sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed >>> sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed >>> sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed >>> sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed >>> sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed >> >> What version of e2fsprog are you running? You need a 1.44 version and >> this should go away. > > To be clear - the Lustre-patched "filefrag" at: > > https://downloads.whamcloud.com/public/e2fsprogs/1.44.3.wc1/ > I looked at Commit 41aee4226789 ("filefrag: Lustre changes to filefrag FIEMAP handling") in the git tree instead. This appears to add 3, features. - It adds an optional device to struct fiemap. Presumably this is always returned if available, else zero is provided which means "the device". - It adds a flag FIEMAP_EXTENT_NET which indicates that the device number is *not* dev_t, but is some fs-specific value - It allows FIEMAP_FLAG_DEVICE_ORDER to be requested. I can't quite work out what this does. Presumably it changes the order that entries are returned (why?) and maybe returns multiple entries for a region that is mirrored ??? As you say, I can see how these might be useful to other filesystems. Maybe we should try upstreaming the support sooner rather than later. NeilBrown
On Wed, Jan 09 2019, Andreas Dilger wrote: > On Jan 9, 2019, at 11:28, James Simmons <jsimmons@infradead.org> wrote: >> >> >>>>> This might be because the upstream Lustre doesn't allow setting per-process >>>>> JobID via environment variable, only as a single per-node value. The real >>>>> unfortunate part is that the "get JobID from environment" actually works for >>>>> every reasonable architecture (even the one which was originally broken >>>>> fixed it), but it got yanked anyway. This is actually one of the features >>>>> of Lustre that lots of HPC sites like to use, since it allows them to track >>>>> on the servers which users/jobs/processes on the client are doing IO. >>>> >>>> To give background for Neil see thread: >>>> >>>> https://lore.kernel.org/patchwork/patch/416846 >>>> >>>> In this case I do agree with Greg. The latest jobid does implement an >>>> upcall and upcalls don't play niece with containers. Their is also the >>>> namespace issue pointed out. I think the namespace issue might be fixed >>>> in the latest OpenSFS code. >>> >>> I'm not sure what you mean? AFAIK, there is no upcall for JobID, except >>> maybe in the kernel client where we weren't allowed to parse the process >>> environment directly. I agree an upcall is problematic with namespaces, >>> in addition to being less functional (only a JobID per node instead of >>> per process), which is why direct access to JOBENV is better IMHO. >> >> I have some evil ideas about this. Need to think about it some more since >> this is a more complex problem. > > Since the kernel manages the environment variables via getenv() and setenv(), > I honestly don't see why accessing them directly is a huge issue? This is, at best, an over-simplification. The kernel doesn't "manage" the environment variables. When a process calls execve() (or similar) a collection of strings called "arguments" and another collection of strings called "environment" are extracted from the processes vm, and used for initializing part of the newly created vm. That is all the kernel does with either. (except for providing /proc/*/cmdline and /proc/*/environ, which is best-effort). getenv() ad setenv() are entirely implemented in user-space. It is quite possible for a process to mess-up its args or environment in a way that will make /proc/*/{cmdline,environ} fail to return anything useful. It is quite possible for the memory storing args and env to be swapped out. If a driver tried to accesses either, it might trigger page-in of that part of the address space, which would probably work but might not be a good idea. As I understand it, the goal here is to have a cluster-wide identifier that can be attached to groups of processes on different nodes. Then stats relating to all of those processes can be collected together. If I didn't think that control-groups were an abomination I would probably suggest using them to define a group of processes, then to attach a tag to that group. Both the netcl and net_prio cgroups do exactly this. perf_event does as well, and even uses the tag exactly for collecting performance-data together for a set of processes. Maybe we could try to champion an fs_event control group? However it is a long time since I've looked at control groups, and they might have moved on a bit. But as I do think that control-groups are an abomination, I couldn't possible suggest any such thing. Unix already has a perfectly good grouping abstraction - process groups (unfortunately there are about 3 sorts of these, but that needn't be a big problem). Stats can be collected based on pgid, and a mapping from client+pgid->jobid can be communicated to whatever collects the statistics ... somehow. NeilBrown
On Tue, Jan 08 2019, James Simmons wrote: > > I have been thinking what would be a better approach since I like to > approach this problem for the 2.13 time frame. Our admins at my work > place want to be able to collect application stats without being root. > So placing stats in debugfs is not an option, which we currently do > the linux client :-( The stats are not a good fit for sysfs. How much statistics data are we talking about here? /proc/self/mountstats shows over 2K of stats for NFS filesystems. Is this in the ball-park or do you need an order-of-magnitude more? NeilBrown
On Jan 9, 2019, at 17:40, NeilBrown <neilb@suse.com> wrote: > > On Tue, Jan 08 2019, Andreas Dilger wrote: >> On Jan 7, 2019, at 21:26, James Simmons <jsimmons@infradead.org> wrote: >>> >>>> sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed >>>> sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed >>>> sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed >>>> sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed >>>> sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed >>> >>> What version of e2fsprog are you running? You need a 1.44 version and >>> this should go away. >> >> To be clear - the Lustre-patched "filefrag" at: >> >> https://downloads.whamcloud.com/public/e2fsprogs/1.44.3.wc1/ > > I looked at Commit 41aee4226789 ("filefrag: Lustre changes to filefrag] FIEMAP handling") in the git tree instead. > > This appears to add 3 features. > > - It adds an optional device to struct fiemap. > Presumably this is always returned if available, else zero is provided > which means "the device". Vanilla filefrag just returns 0 today. For Lustre filefrag it returns the OST index on which the blocks are located. For local filesystems I'm expecting it to return the rdev of the block device, like 0x801 or similar. > - It adds a flag FIEMAP_EXTENT_NET which indicates that the device > number is *not* dev_t, but is some fs-specific value Right. > - It allows FIEMAP_FLAG_DEVICE_ORDER to be requested. I can't quite > work out what this does. The logic makes sense once you understand it. Consider a striped Lustre file, or perhaps on an MD RAID device. If you returned the blocks in file-logical order (i.e. block 0...EOF), then the largest extent that could be returned for the same device would be stripe_size/chunk_size. This would be very verbose (e.g. 1TB file with 1MB stripe_size would be 1M lines of output, though still better than the 256M lines from the old FIBMAP-based filefrag). This would make it very hard to see if the file allocation is contiguous or fragmented, which was our original goal for implementing FIEMAP. The DEVICE_ORDER flag means "return blocks in the underlying device order". This allows returning block extents of the maximum size for the underlying filesystem (128MB for ext4), and much more clearly shows whether the underlying file allocation is contiguous or fragmented. It also simplifies the implementation at the Lustre side, because we are essentially doing a series of per-OST FIEMAP calls until the OST object is done, then moving on to the next object in the file. The alternative (which I've thought of impementing, just for compatibility reasons) would be to interleave the FIEMAP output from each OST by the logical file offset, but it would be ugly and not very useful, except for tools that want to see if a file has holes or not. $ filefrag -v /myth/tmp/4stripe Filesystem type is: bd00bd0 File size of /myth/tmp/4stripe is 104857600 (102400 blocks of 1024 bytes) ext: device_logical: physical_offset: length: dev: flags: 0: 0.. 28671: 1837711360..1837740031: 28672: 0004: net 1: 0.. 24575: 1280876544..1280901119: 24576: 0000: net 2: 0.. 24575: 1535643648..1535668223: 24576: 0001: net 3: 0.. 24575: 4882608128..4882632703: 24576: 0003: last,net > Presumably it changes the order that entries are returned (why?) and > maybe returns multiple entries for a region that is mirrored ??? The multiple entries per region is needed for mirrored files. > As you say, I can see how these might be useful to other filesystems. > Maybe we should try upstreaming the support sooner rather than later. I've tried a few times, but have been rebuffed because Lustre isn't in the mainline. Originally, BtrFS wasn't going to have multiple device support, but that has changed since the time FIEMAP was introduced. I'd of course be happy if it was in mainline, or at least the fields in struct fiemap_extent reserved to avoid future conflicts. There was also a proposal from SuSE for BtrFS to add support for compressed extents, but it never quite made it over the finish line: David Serba "fiemap: introduce EXTENT_DATA_COMPRESSED flag" Cheers, Andreas --- Andreas Dilger Principal Lustre Architect Whamcloud
On Jan 9, 2019, at 18:46, NeilBrown <neilb@suse.com> wrote: > > On Tue, Jan 08 2019, James Simmons wrote: >> >> I have been thinking what would be a better approach since I like to >> approach this problem for the 2.13 time frame. Our admins at my work >> place want to be able to collect application stats without being root. >> So placing stats in debugfs is not an option, which we currently do >> the linux client :-( The stats are not a good fit for sysfs. > > How much statistics data are we talking about here? > /proc/self/mountstats > shows over 2K of stats for NFS filesystems. > Is this in the ball-park or do you need an order-of-magnitude more? Ah, the joys of being grandfathered into the code... One of the larger normal /proc files is the "obdfilter.*.brw_stats" files, which I agree is a bit of an abomination of ASCII formatted output. Most of the regular stats files are about 1KB in size, like: wc /proc/fs/lustre//osc/myth-OST000*/stats 14 107 1028 /proc/fs/lustre//osc/myth-OST0000-osc-ffff880429ee7c00/stats 14 107 1011 /proc/fs/lustre//osc/myth-OST0001-osc-ffff880429ee7c00/stats 13 99 989 /proc/fs/lustre//osc/myth-OST0002-osc-ffff880429ee7c00/stats 14 107 1043 /proc/fs/lustre//osc/myth-OST0003-osc-ffff880429ee7c00/stats 14 107 1075 /proc/fs/lustre//osc/myth-OST0004-osc-ffff880429ee7c00/stats The {obdfilter,mdt}.*.job_stats files can become quite big on a large system if there are lots of jobs running. James would have to report on what kind of sizes they get on their nearly-largest-in-the-world filesystem. Definitely into the MB range. I don't think there would be a need to poll that super frequently, maybe in the 60s range, as it keeps the stats for some minutes after a job stops IO, or it would be impossible to gather accurate stats for the whole job. Cheers, Andreas --- Andreas Dilger Principal Lustre Architect Whamcloud
On Jan 9, 2019, at 18:36, NeilBrown <neilb@suse.com> wrote: > > On Wed, Jan 09 2019, Andreas Dilger wrote: > >> On Jan 9, 2019, at 11:28, James Simmons <jsimmons@infradead.org> wrote: >>> >>>>>> This might be because the upstream Lustre doesn't allow setting >>>>>> per-process JobID via environment variable, only as a single >>>>>> per-node value. The real unfortunate part is that the "get JobID >>>>>> from environment" actually works for every reasonable architecture >>>>>> (even the one which was originally broken fixed it), but it got >>>>>> yanked anyway. This is actually one of the features of Lustre that >>>>>> lots of HPC sites like to use, since it allows them to track on the >>>>>> servers which users/jobs/processes on the client are doing IO. >>>>> >>>>> To give background for Neil see thread: >>>>> >>>>> https://lore.kernel.org/patchwork/patch/416846 >>>>> >>>>> In this case I do agree with Greg. The latest jobid does implement an >>>>> upcall and upcalls don't play niece with containers. Their is also the >>>>> namespace issue pointed out. I think the namespace issue might be fixed >>>>> in the latest OpenSFS code. >>>> >>>> I'm not sure what you mean? AFAIK, there is no upcall for JobID, except >>>> maybe in the kernel client where we weren't allowed to parse the process >>>> environment directly. I agree an upcall is problematic with namespaces, >>>> in addition to being less functional (only a JobID per node instead of >>>> per process), which is why direct access to JOBENV is better IMHO. >>> >>> I have some evil ideas about this. Need to think about it some more since >>> this is a more complex problem. >> >> Since the kernel manages the environment variables via getenv() and setenv(), I honestly don't see why accessing them directly is a huge issue? > > This is, at best, an over-simplification. The kernel doesn't "manage" the > environment variables. > When a process calls execve() (or similar) a collection of strings called > "arguments" and another collection of strings called "environment" are > extracted from the processes vm, and used for initializing part of the > newly created vm. That is all the kernel does with either. > (except for providing /proc/*/cmdline and /proc/*/environ, which is best-effort). Sure, and we only provide a best effort at parsing it as a series of NUL- terminated strings. Userspace can't corrupt the kernel VMA mappings, so at worst we don't find anything we are looking for, which can also happen if no JobID is set in the first place. It's not really any more dangerous than any copy_from_user() in the filesystem/ioctl code. > getenv() ad setenv() are entirely implemented in user-space. It is quite > possible for a process to mess-up its args or environment in a way that > will make /proc/*/{cmdline,environ} fail to return anything useful. If userspace has also messed it up so badly that it can't parse the environment variables themselves, then even a userspace upcall isn't going to work. > It is quite possible for the memory storing args and env to be swapped > out. If a driver tried to accesses either, it might trigger page-in of > that part of the address space, which would probably work but might not > be a good idea. I've never seen a report of problems like this. Processes that are swapped out are probably not going to be submitting IO either... We cache the JobID in the kernel so it is only fetched on the first IO for that process ID. There once was a bug where the JobID was fetched during mmap IO which caused a deadlock, and was since fixed. We also added the JobID cache, which has reduced the overhead significantly. > As I understand it, the goal here is to have a cluster-wide identifier > that can be attached to groups of processes on different nodes. Then > stats relating to all of those processes can be collected together. Correct, but it isn't just _any_ system-wide identifier. The large parallel MPI applications already get assigned an identifier by the batch scheduler before they are run, and a large number of tools in these systems use JobID for tracking logs, CPU/IO accounting, etc. The JobID is stored in an environment variable (e.g. SLURM_JOB_ID) by the batch scheduler before the actual job is forked. See the comment at the start of lustre/obdclass/lprocfs_jobstats.c for examples. We can also set artificial jobid values for debugging or use with systems not using MPI (e.g. procname_uid), but they do not need access to the process environment. For Lustre, the admin does a one-time configuration of the name of the environment variable ("lctl conf_param jobid_var=SLURM_JOB_ID") to tell the kernel which environment variable to use. > ... But as I do think that control-groups are an abomination, I couldn't > possible suggest any such thing. > Unix already has a perfectly good grouping abstraction - process groups > (unfortunately there are about 3 sorts of these, but that needn't be a > big problem). Stats can be collected based on pgid, and a mapping from > client+pgid->jobid can be communicated to whatever collects the > statistics ... somehow. So, right now we have "scan a few KB of kernel memory for a string" periodically in the out-of-tree client (see jobid_get_from_environ() and cfs_get_environ()), and then a hash table that caches the JobID internally and maps the pid to the JobID when it is needed. Most of the code is an simplified copy of access_process_vm() for kernels after v2.6.24-rc1-652-g02c3530da6b9 when it was un-EXPORT_SYMBOL'd, but since kernel v4.9-rc3-36-gfcd35857d662 it is again exported so it makes sense to add a configure check. Most of the rest is for when the variable or value crosses a page boundary. Conversely, the kernel client has something like "upcall a userspace process, fork a program (millions of cycles), have that program do the same scan of the kernel environment memory, but now it is doing it in userspace, open a file, write the environment variable to the kernel, exit and clean up the process that was created" to do the same thing. Using a pgid seems mostly unusable, since the stats are not collected on the client, they are collected on the server (the JobID is sent with every userspace-driven RPC to the server), which is the centralized location where all clients submit their IO. JobStats gives us relatively easy and direct method to see which client process(es) are going a lot of IO or RPCs, just looking into a /proc file if necessary (though they are typically further centralized and monitored from the multiple servers). We can't send a different pgid from each client along with the RPCs and hope to aggregate that at the server without adding huge complexity. We would need real-time mapping from every new pgid on each client (maybe thousands per second per client) to the JobID then passed to the MDS/OSS so that they can reverse-map the pgid back into a JobID before the first RPC arrives at the server. Alternately, track separate stats for each client:pgid combination on the server (num_cores * clients = millions of times more than today if there are multiple jobs per client) until they are fetched into userspace for mapping and re-aggregation. Thanks, but I'd rather stick with the relatively simple and direct method we are using today. It's worked without problems for 10 years of kernels. I think that is one of the big obstacles that we face with many of the upstream kernel maintainers, is that they are focussed on issues that are local to a one or a few nodes, but we have to deal with issues that may involve hundreds or thousands of different nodes working as a single task (unlike cloud stuff where there may be many nodes, but they are all doing things independently). It's not that we develop crazy things because we have spare time to burn, but because they are needed to deal sanely with such environments. Cheers, Andreas --- Andreas Dilger Principal Lustre Architect Whamcloud
diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ver.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ver.h index 1428fdd..e7a2eda 100644 --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ver.h +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ver.h @@ -2,10 +2,10 @@ #define _LUSTRE_VER_H_ #define LUSTRE_MAJOR 2 -#define LUSTRE_MINOR 8 +#define LUSTRE_MINOR 9 #define LUSTRE_PATCH 99 #define LUSTRE_FIX 0 -#define LUSTRE_VERSION_STRING "2.8.99" +#define LUSTRE_VERSION_STRING "2.9.99" #define OBD_OCD_VERSION(major, minor, patch, fix) \ (((major) << 24) + ((minor) << 16) + ((patch) << 8) + (fix))
With the majority of missing patches and features from the lustre 2.10 release merged upstream its time to update the upstream client's version. Signed-off-by: James Simmons <jsimmons@infradead.org> --- drivers/staging/lustre/include/uapi/linux/lustre/lustre_ver.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)