Message ID | 155053494716.24125.12637812266850199400.stgit@noble.brown (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | More lustre patches from obdclass | expand |
On Feb 18, 2019, at 17:09, NeilBrown <neilb@suse.com> wrote: > > The jobid_var sysfs attribute only has 3 meaningful values. > Other values cause lustre_get_jobid() to return an error > which is uniformly ignored. > > To improve usability and resilience, check that the value > written is acceptable before storing it. > > Signed-off-by: NeilBrown <neilb@suse.com> This will no longer be true once https://review.whamcloud.com/31691 commit 6488c0ec57de ("LU-10698 obdclass: allow specifying complex jobids") is merged into your tree. That allows specifying more useful jobid values, similar to how coredump files can be named: Allow specifying a format string for the jobid_name variable to create a jobid for processes on the client. The jobid_name is used when jobid_var=nodelocal, if jobid_name contains "%j", or as a fallback if getting the specified jobid_var from the environment fails. The jobid_node string allows the following escape sequences: %e = executable name %g = group ID %h = hostname (system utsname) %j = jobid from jobid_var environment variable %p = process ID %u = user ID Any unknown escape sequences are dropped. Other arbitrary characters pass through unmodified, up to the maximum jobid string size of 32, though whitespace within the jobid is not copied. This allows, for example, specifying an arbitrary prefix, such as the cluster name, in addition to the traditional "procname.uid" format, to distinguish between jobs running on clients in different clusters: lctl set_param jobid_var=nodelocal jobid_name=cluster2.%e.%u or lctl set_param jobid_var=SLURM_JOB_ID jobid_name=cluster2.%j.%e To use an environment-specified JobID, if available, but fall back to a static string for all processes that do not have a valid JobID: lctl set_param jobid_var=SLURM_JOB_ID jobid_name=unknown Currently the "%j" function was removed from the kernel client, even though there is no technical reason it can't work (i.e. all of the code to implement it is available and exported). This is actually super useful for HPC cluster administrators to monitor per-job IO bandwidth and IOPS on the server, and something that I think should be restored. Cheers, Andreas > --- > drivers/staging/lustre/lustre/obdclass/obd_sysfs.c | 21 ++++++++++++++------ > 1 file changed, 15 insertions(+), 6 deletions(-) > > diff --git a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c > index 57a6f2c2da1c..69ccc6a55947 100644 > --- a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c > +++ b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c > @@ -216,16 +216,25 @@ static ssize_t jobid_var_store(struct kobject *kobj, struct attribute *attr, > const char *buffer, > size_t count) > { > + static char *valid[] = { > + JOBSTATS_DISABLE, > + JOBSTATS_PROCNAME_UID, > + JOBSTATS_NODELOCAL, > + NULL > + }; > + int i; > + > if (!count || count > JOBSTATS_JOBID_VAR_MAX_LEN) > return -EINVAL; > > - memset(obd_jobid_var, 0, JOBSTATS_JOBID_VAR_MAX_LEN + 1); > - > - memcpy(obd_jobid_var, buffer, count); > + for (i = 0; valid[i]; i++) > + if (sysfs_streq(buffer, valid[i])) > + break; > + if (!valid[i]) > + return -EINVAL; > > - /* Trim the trailing '\n' if any */ > - if (obd_jobid_var[count - 1] == '\n') > - obd_jobid_var[count - 1] = 0; > + memset(obd_jobid_var, 0, JOBSTATS_JOBID_VAR_MAX_LEN + 1); > + strcpy(obd_jobid_var, valid[i]); > > return count; > } > > Cheers, Andreas --- Andreas Dilger Principal Lustre Architect Whamcloud
On Wed, Feb 27 2019, Andreas Dilger wrote: > On Feb 18, 2019, at 17:09, NeilBrown <neilb@suse.com> wrote: >> >> The jobid_var sysfs attribute only has 3 meaningful values. >> Other values cause lustre_get_jobid() to return an error >> which is uniformly ignored. >> >> To improve usability and resilience, check that the value >> written is acceptable before storing it. >> >> Signed-off-by: NeilBrown <neilb@suse.com> > > This will no longer be true once https://review.whamcloud.com/31691 > commit 6488c0ec57de ("LU-10698 obdclass: allow specifying complex jobids") Actually it will. That patch changes the use of jobid_name, my patch restricts the values of jobid_var. I just realized why it is called "jobid_var" - in OpenSFS lustre, it can be an environment variable name. In drivers/staging lustre it cannot, so the name is a little odd. > > Currently the "%j" function was removed from the kernel client, even > though there is no technical reason it can't work (i.e. all of the code > to implement it is available and exported). This is actually super > useful for HPC cluster administrators to monitor per-job IO bandwidth > and IOPS on the server, and something that I think should be restored. I think that you probably need to let go of that desire - I don't think it is going to happen. While the code may, as you say, work - it is easy to dislike that approach, and would be hard to push against such resistance. I have an alternate approach, patch below. Instead of export LUSTRE_JOBID=foobar and process can run echo foobar > /sys/fs/lustre/jobid_this_session and it will affect all processes in the current "session". Could you warm to this approach at all? Thanks, NeilBrown From: NeilBrown <neilb@suse.com> Subject: [PATCH] lustre: obdclass: allow per-session jobids. Lustre includes a jobid in all RPC message sent to the server. This is used to collected per-job statistics, where a "job" can involve multiple processes on multiple nodes in a cluster. Nodes in a cluster can be running processes for multiple jobs, so it is best if different processes can have different jobids, and that processes on different nodes can have the same job id. This is not currently possible with the drivers/staging code. Lustre traditionally uses an environment variable to name a job, but having the kernel reach into the address space of a process to find that environment variable is seen by some developers to be an unacceptable design choice. This patch provides an alternate method, leveraging the concept of a "session id", as set with setsid(). Each login session already gets a unique sid which is preserved for all processes in that session unless explicitly changed (with setsid(1)). When a process in a session writes to /sys/fs/lustre/jobid_this_session, the string becomes the name for that session. If jobid_var is set to "manual", then the per-session jobid is used for the jobid for all requests from processes in that session. When a session ends, the jobid information will be purged within 5 minutes. Signed-off-by: NeilBrown <neilb@suse.com> --- .../staging/lustre/lustre/include/lprocfs_status.h | 1 + drivers/staging/lustre/lustre/include/obd_class.h | 3 + drivers/staging/lustre/lustre/obdclass/class_obd.c | 160 ++++++++++++++++++++- drivers/staging/lustre/lustre/obdclass/obd_sysfs.c | 41 ++++++ 4 files changed, 204 insertions(+), 1 deletion(-) diff --git a/drivers/staging/lustre/lustre/include/lprocfs_status.h b/drivers/staging/lustre/lustre/include/lprocfs_status.h index 8565c28f08ee..1335a5722903 100644 --- a/drivers/staging/lustre/lustre/include/lprocfs_status.h +++ b/drivers/staging/lustre/lustre/include/lprocfs_status.h @@ -370,6 +370,7 @@ static inline void s2dhms(struct dhms *ts, time64_t secs64) #define JOBSTATS_DISABLE "disable" #define JOBSTATS_PROCNAME_UID "procname_uid" #define JOBSTATS_NODELOCAL "nodelocal" +#define JOBSTATS_MANUAL "manual" /* obd_config.c */ void lustre_register_client_process_config(int (*cpc)(struct lustre_cfg *lcfg)); diff --git a/drivers/staging/lustre/lustre/include/obd_class.h b/drivers/staging/lustre/lustre/include/obd_class.h index 50b08c89ecc5..08003f3dd467 100644 --- a/drivers/staging/lustre/lustre/include/obd_class.h +++ b/drivers/staging/lustre/lustre/include/obd_class.h @@ -55,6 +55,9 @@ extern rwlock_t obd_dev_lock; struct obd_device *class_exp2obd(struct obd_export *exp); int class_handle_ioctl(unsigned int cmd, unsigned long arg); int lustre_get_jobid(char *jobid); +char *jobid_current(void); +int jobid_set_current(char *jobid); + struct lu_device_type; diff --git a/drivers/staging/lustre/lustre/obdclass/class_obd.c b/drivers/staging/lustre/lustre/obdclass/class_obd.c index 1fcbda128a58..19ce3c858e59 100644 --- a/drivers/staging/lustre/lustre/obdclass/class_obd.c +++ b/drivers/staging/lustre/lustre/obdclass/class_obd.c @@ -79,6 +79,144 @@ EXPORT_SYMBOL(at_extra); char obd_jobid_var[JOBSTATS_JOBID_VAR_MAX_LEN + 1] = JOBSTATS_DISABLE; char obd_jobid_node[LUSTRE_JOBID_SIZE + 1]; +/* + * Jobid can be set for a session (see setsid(2)) by writing to + * a sysfs file from any process in that session. + * The jobids are stored in a hash table indexed by the relevant + * struct pid. We periodically look for entries where the pid has + * no PIDTYPE_SID tasks any more, and prune them. This happens within + * 5 seconds of a jobid being added, and every 5 minutes when jobids exist, + * but none are added. + */ +#define JOBID_EXPEDITED_CLEAN (5 * HZ) +#define JOBID_BACKGROUND_CLEAN (5 * 60 * HZ) + +struct session_jobid { + struct pid *session; + struct rhash_head linkage; + struct rcu_head rcu; + char jobid[1]; +}; + +const static struct rhashtable_params jobid_params = { + .key_len = sizeof(struct pid *), + .key_offset = offsetof(struct session_jobid, session), + .head_offset = offsetof(struct session_jobid, linkage), +}; +static struct rhashtable session_jobids; + +/* + * jobid_current must be called with rcu_read_lock held. + * if it returns non-NULL, the string can only be used + * until rcu_read_unlock is called. + */ +char *jobid_current(void) +{ + struct pid *sid = current->signal->pids[PIDTYPE_SID]; + struct session_jobid *sj; + + sj = rhashtable_lookup_fast(&session_jobids, &sid, jobid_params); + if (sj) + return sj->jobid; + return NULL; +} + +static void jobid_prune_expedite(void); +/* + * jobid_set_current will try to add a new entry + * to the table. If one exists with the same key, the + * jobid will be replaced + */ +int jobid_set_current(char *jobid) +{ + struct pid *sid = current->signal->pids[PIDTYPE_SID]; + struct session_jobid *sj, *origsj; + int ret; + + sj = kmalloc(sizeof(*sj) + strlen(jobid), GFP_KERNEL); + if (!sj) + return -ENOMEM; + rcu_read_lock(); + sj->session = get_pid(sid); + strcpy(sj->jobid, jobid); + origsj = rhashtable_lookup_get_insert_fast(&session_jobids, + &sj->linkage, + jobid_params); + if (origsj == NULL) { + /* successful insert */ + rcu_read_unlock(); + jobid_prune_expedite(); + return 0; + } + + if (IS_ERR(origsj)) { + put_pid(sj->session); + kfree(sj); + rcu_read_unlock(); + return PTR_ERR(origsj); + } + ret = rhashtable_replace_fast(&session_jobids, + &origsj->linkage, + &sj->linkage, + jobid_params); + if (ret) { + put_pid(sj->session); + kfree(sj); + rcu_read_unlock(); + return ret; + } + put_pid(origsj->session); + rcu_read_unlock(); + kfree_rcu(origsj, rcu); + jobid_prune_expedite(); + + return 0; +} + +static void jobid_free(void *vsj, void *arg) +{ + struct session_jobid *sj = vsj; + put_pid(sj->session); + kfree(sj); +} + +static void jobid_prune(struct work_struct *work); +static DECLARE_DELAYED_WORK(jobid_prune_work, jobid_prune); +static int jobid_prune_expedited; +static void jobid_prune(struct work_struct *work) +{ + int remaining = 0; + struct rhashtable_iter iter; + struct session_jobid *sj; + + jobid_prune_expedited = 0; + rhashtable_walk_enter(&session_jobids, &iter); + rhashtable_walk_start(&iter); + while ((sj = rhashtable_walk_next(&iter)) != NULL) { + if (!hlist_empty(&sj->session->tasks[PIDTYPE_SID])) { + remaining++; + continue; + } + if (rhashtable_remove_fast(&session_jobids, + &sj->linkage, jobid_params) == 0) { + put_pid(sj->session); + kfree_rcu(sj, rcu); + } + } + rhashtable_walk_stop(&iter); + rhashtable_walk_exit(&iter); + if (remaining) + schedule_delayed_work(&jobid_prune_work, JOBID_BACKGROUND_CLEAN); +} + +static void jobid_prune_expedite(void) +{ + if (!jobid_prune_expedited) { + jobid_prune_expedited = 1; + mod_delayed_work(system_wq, &jobid_prune_work, JOBID_EXPEDITED_CLEAN); + } +} + /* Get jobid of current process from stored variable or calculate * it from pid and user_id. * @@ -108,6 +246,17 @@ int lustre_get_jobid(char *jobid) goto out_cache_jobid; } + if (strcmp(obd_jobid_var, JOBSTATS_MANUAL) == 0) { + char *jid; + rcu_read_lock(); + jid = jobid_current(); + if (jid) + strlcpy(tmp_jobid, jid, sizeof(tmp_jobid)); + rcu_read_unlock(); + if (jid) + goto out_cache_jobid; + } + return -ENOENT; out_cache_jobid: @@ -663,10 +812,13 @@ static int __init obdclass_init(void) if (err) goto cleanup_zombie_impexp; + err = rhashtable_init(&session_jobids, &jobid_params); + if (err) + goto cleanup_class_handle; err = misc_register(&obd_psdev); if (err) { CERROR("cannot register OBD miscdevices: err %d\n", err); - goto cleanup_class_handle; + goto cleanup_session_jobids; } /* Default the dirty page cache cap to 1/2 of system memory. @@ -724,6 +876,9 @@ static int __init obdclass_init(void) cleanup_deregister: misc_deregister(&obd_psdev); +cleanup_session_jobids: + rhashtable_free_and_destroy(&session_jobids, jobid_free, NULL); + cleanup_class_handle: class_handle_cleanup(); @@ -743,6 +898,9 @@ static void obdclass_exit(void) cl_global_fini(); lu_global_fini(); + cancel_delayed_work_sync(&jobid_prune_work); + rhashtable_free_and_destroy(&session_jobids, jobid_free, NULL); + obd_cleanup_caches(); class_procfs_clean(); diff --git a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c index 69ccc6a55947..112782e56793 100644 --- a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c +++ b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c @@ -220,6 +220,7 @@ static ssize_t jobid_var_store(struct kobject *kobj, struct attribute *attr, JOBSTATS_DISABLE, JOBSTATS_PROCNAME_UID, JOBSTATS_NODELOCAL, + JOBSTATS_MANUAL, NULL }; int i; @@ -263,6 +264,44 @@ static ssize_t jobid_name_store(struct kobject *kobj, struct attribute *attr, return count; } +static ssize_t jobid_this_session_show(struct kobject *kobj, + struct attribute *attr, + char *buf) +{ + char *jid; + int ret = -ENOENT; + + rcu_read_lock(); + jid = jobid_current(); + if (jid) + ret = snprintf(buf, PAGE_SIZE, "%s\n", jid); + rcu_read_unlock(); + return ret; +} + +static ssize_t jobid_this_session_store(struct kobject *kobj, + struct attribute *attr, + const char *buffer, + size_t count) +{ + char *jobid; + int len; + int ret; + + if (!count || count > LUSTRE_JOBID_SIZE) + return -EINVAL; + + jobid = kstrndup(buffer, count, GFP_KERNEL); + if (!jobid) + return -ENOMEM; + len = strcspn(jobid, " \n"); + jobid[len] = '\0'; + ret = jobid_set_current(jobid); + kfree(jobid); + + return ret ?: count; +} + /* Root for /sys/kernel/debug/lustre */ struct dentry *debugfs_lustre_root; EXPORT_SYMBOL_GPL(debugfs_lustre_root); @@ -272,6 +311,7 @@ LUSTRE_RO_ATTR(pinger); LUSTRE_RO_ATTR(health_check); LUSTRE_RW_ATTR(jobid_var); LUSTRE_RW_ATTR(jobid_name); +LUSTRE_RW_ATTR(jobid_this_session); static struct attribute *lustre_attrs[] = { &lustre_attr_version.attr, @@ -279,6 +319,7 @@ static struct attribute *lustre_attrs[] = { &lustre_attr_health_check.attr, &lustre_attr_jobid_name.attr, &lustre_attr_jobid_var.attr, + &lustre_attr_jobid_this_session.attr, &lustre_sattr_timeout.u.attr, &lustre_attr_max_dirty_mb.attr, &lustre_sattr_debug_peer_on_timeout.u.attr,
On Feb 28, 2019, at 19:35, NeilBrown <neilb@suse.com> wrote: > > On Wed, Feb 27 2019, Andreas Dilger wrote: > >> On Feb 18, 2019, at 17:09, NeilBrown <neilb@suse.com> wrote: >>> >>> The jobid_var sysfs attribute only has 3 meaningful values. >>> Other values cause lustre_get_jobid() to return an error >>> which is uniformly ignored. >>> >>> To improve usability and resilience, check that the value >>> written is acceptable before storing it. >>> >>> Signed-off-by: NeilBrown <neilb@suse.com> >> >> This will no longer be true once https://review.whamcloud.com/31691 >> commit 6488c0ec57de ("LU-10698 obdclass: allow specifying complex jobids") > > Actually it will. That patch changes the use of jobid_name, my patch > restricts the values of jobid_var. > > I just realized why it is called "jobid_var" - in OpenSFS lustre, it can > be an environment variable name. In drivers/staging lustre it cannot, > so the name is a little odd. > >> >> Currently the "%j" function was removed from the kernel client, even >> though there is no technical reason it can't work (i.e. all of the code >> to implement it is available and exported). This is actually super >> useful for HPC cluster administrators to monitor per-job IO bandwidth >> and IOPS on the server, and something that I think should be restored. > > I think that you probably need to let go of that desire - I don't think > it is going to happen. While the code may, as you say, work - it is > easy to dislike that approach, and would be hard to push against such > resistance. > > I have an alternate approach, patch below. Instead of > export LUSTRE_JOBID=foobar > and process can run > echo foobar > /sys/fs/lustre/jobid_this_session > > and it will affect all processes in the current "session". > > Could you warm to this approach at all? This is the best alternative that I've seen so far. That said, the choice of storing the JobID as an environment variable wasn't something that we invented ourselves, but rather this is what the various job schedulers do when a job is first started, and the code to extract the JobID out of the process environment was what we implemented to work within the constraints that were given to us. The benefit of course is that we don't depend on some external process to be run for every process to extract the environment variable from the kernel memory and send it back to us. Allowing different processes to have different JobIDs on a single node is definitely a requirement for most environments these days, and so far this is the only solution that has allowed that to work. James is more familiar with the production side of the house, so I'll let him comment on how easy/hard it would be to get this kind of change added to the per-job preamble script so that it is set for all processes. Cheers, Andreas > From: NeilBrown <neilb@suse.com> > Subject: [PATCH] lustre: obdclass: allow per-session jobids. > > Lustre includes a jobid in all RPC message sent to the server. This > is used to collected per-job statistics, where a "job" can involve > multiple processes on multiple nodes in a cluster. > > Nodes in a cluster can be running processes for multiple jobs, so it > is best if different processes can have different jobids, and that > processes on different nodes can have the same job id. > > This is not currently possible with the drivers/staging code. > > Lustre traditionally uses an environment variable to name a job, but > having the kernel reach into the address space of a process to find > that environment variable is seen by some developers to be an > unacceptable design choice. > > This patch provides an alternate method, leveraging the concept of a > "session id", as set with setsid(). Each login session already gets a > unique sid which is preserved for all processes in that session unless > explicitly changed (with setsid(1)). > When a process in a session writes to > /sys/fs/lustre/jobid_this_session, the string becomes the name for > that session. > If jobid_var is set to "manual", then the per-session jobid is used > for the jobid for all requests from processes in that session. > > When a session ends, the jobid information will be purged within 5 > minutes. > > Signed-off-by: NeilBrown <neilb@suse.com> > --- > .../staging/lustre/lustre/include/lprocfs_status.h | 1 + > drivers/staging/lustre/lustre/include/obd_class.h | 3 + > drivers/staging/lustre/lustre/obdclass/class_obd.c | 160 ++++++++++++++++++++- > drivers/staging/lustre/lustre/obdclass/obd_sysfs.c | 41 ++++++ > 4 files changed, 204 insertions(+), 1 deletion(-) > > diff --git a/drivers/staging/lustre/lustre/include/lprocfs_status.h b/drivers/staging/lustre/lustre/include/lprocfs_status.h > index 8565c28f08ee..1335a5722903 100644 > --- a/drivers/staging/lustre/lustre/include/lprocfs_status.h > +++ b/drivers/staging/lustre/lustre/include/lprocfs_status.h > @@ -370,6 +370,7 @@ static inline void s2dhms(struct dhms *ts, time64_t secs64) > #define JOBSTATS_DISABLE "disable" > #define JOBSTATS_PROCNAME_UID "procname_uid" > #define JOBSTATS_NODELOCAL "nodelocal" > +#define JOBSTATS_MANUAL "manual" > > /* obd_config.c */ > void lustre_register_client_process_config(int (*cpc)(struct lustre_cfg *lcfg)); > diff --git a/drivers/staging/lustre/lustre/include/obd_class.h b/drivers/staging/lustre/lustre/include/obd_class.h > index 50b08c89ecc5..08003f3dd467 100644 > --- a/drivers/staging/lustre/lustre/include/obd_class.h > +++ b/drivers/staging/lustre/lustre/include/obd_class.h > @@ -55,6 +55,9 @@ extern rwlock_t obd_dev_lock; > struct obd_device *class_exp2obd(struct obd_export *exp); > int class_handle_ioctl(unsigned int cmd, unsigned long arg); > int lustre_get_jobid(char *jobid); > +char *jobid_current(void); > +int jobid_set_current(char *jobid); > + > > struct lu_device_type; > > diff --git a/drivers/staging/lustre/lustre/obdclass/class_obd.c b/drivers/staging/lustre/lustre/obdclass/class_obd.c > index 1fcbda128a58..19ce3c858e59 100644 > --- a/drivers/staging/lustre/lustre/obdclass/class_obd.c > +++ b/drivers/staging/lustre/lustre/obdclass/class_obd.c > @@ -79,6 +79,144 @@ EXPORT_SYMBOL(at_extra); > char obd_jobid_var[JOBSTATS_JOBID_VAR_MAX_LEN + 1] = JOBSTATS_DISABLE; > char obd_jobid_node[LUSTRE_JOBID_SIZE + 1]; > > +/* > + * Jobid can be set for a session (see setsid(2)) by writing to > + * a sysfs file from any process in that session. > + * The jobids are stored in a hash table indexed by the relevant > + * struct pid. We periodically look for entries where the pid has > + * no PIDTYPE_SID tasks any more, and prune them. This happens within > + * 5 seconds of a jobid being added, and every 5 minutes when jobids exist, > + * but none are added. > + */ > +#define JOBID_EXPEDITED_CLEAN (5 * HZ) > +#define JOBID_BACKGROUND_CLEAN (5 * 60 * HZ) > + > +struct session_jobid { > + struct pid *session; > + struct rhash_head linkage; > + struct rcu_head rcu; > + char jobid[1]; > +}; > + > +const static struct rhashtable_params jobid_params = { > + .key_len = sizeof(struct pid *), > + .key_offset = offsetof(struct session_jobid, session), > + .head_offset = offsetof(struct session_jobid, linkage), > +}; > +static struct rhashtable session_jobids; > + > +/* > + * jobid_current must be called with rcu_read_lock held. > + * if it returns non-NULL, the string can only be used > + * until rcu_read_unlock is called. > + */ > +char *jobid_current(void) > +{ > + struct pid *sid = current->signal->pids[PIDTYPE_SID]; > + struct session_jobid *sj; > + > + sj = rhashtable_lookup_fast(&session_jobids, &sid, jobid_params); > + if (sj) > + return sj->jobid; > + return NULL; > +} > + > +static void jobid_prune_expedite(void); > +/* > + * jobid_set_current will try to add a new entry > + * to the table. If one exists with the same key, the > + * jobid will be replaced > + */ > +int jobid_set_current(char *jobid) > +{ > + struct pid *sid = current->signal->pids[PIDTYPE_SID]; > + struct session_jobid *sj, *origsj; > + int ret; > + > + sj = kmalloc(sizeof(*sj) + strlen(jobid), GFP_KERNEL); > + if (!sj) > + return -ENOMEM; > + rcu_read_lock(); > + sj->session = get_pid(sid); > + strcpy(sj->jobid, jobid); > + origsj = rhashtable_lookup_get_insert_fast(&session_jobids, > + &sj->linkage, > + jobid_params); > + if (origsj == NULL) { > + /* successful insert */ > + rcu_read_unlock(); > + jobid_prune_expedite(); > + return 0; > + } > + > + if (IS_ERR(origsj)) { > + put_pid(sj->session); > + kfree(sj); > + rcu_read_unlock(); > + return PTR_ERR(origsj); > + } > + ret = rhashtable_replace_fast(&session_jobids, > + &origsj->linkage, > + &sj->linkage, > + jobid_params); > + if (ret) { > + put_pid(sj->session); > + kfree(sj); > + rcu_read_unlock(); > + return ret; > + } > + put_pid(origsj->session); > + rcu_read_unlock(); > + kfree_rcu(origsj, rcu); > + jobid_prune_expedite(); > + > + return 0; > +} > + > +static void jobid_free(void *vsj, void *arg) > +{ > + struct session_jobid *sj = vsj; > + put_pid(sj->session); > + kfree(sj); > +} > + > +static void jobid_prune(struct work_struct *work); > +static DECLARE_DELAYED_WORK(jobid_prune_work, jobid_prune); > +static int jobid_prune_expedited; > +static void jobid_prune(struct work_struct *work) > +{ > + int remaining = 0; > + struct rhashtable_iter iter; > + struct session_jobid *sj; > + > + jobid_prune_expedited = 0; > + rhashtable_walk_enter(&session_jobids, &iter); > + rhashtable_walk_start(&iter); > + while ((sj = rhashtable_walk_next(&iter)) != NULL) { > + if (!hlist_empty(&sj->session->tasks[PIDTYPE_SID])) { > + remaining++; > + continue; > + } > + if (rhashtable_remove_fast(&session_jobids, > + &sj->linkage, jobid_params) == 0) { > + put_pid(sj->session); > + kfree_rcu(sj, rcu); > + } > + } > + rhashtable_walk_stop(&iter); > + rhashtable_walk_exit(&iter); > + if (remaining) > + schedule_delayed_work(&jobid_prune_work, JOBID_BACKGROUND_CLEAN); > +} > + > +static void jobid_prune_expedite(void) > +{ > + if (!jobid_prune_expedited) { > + jobid_prune_expedited = 1; > + mod_delayed_work(system_wq, &jobid_prune_work, JOBID_EXPEDITED_CLEAN); > + } > +} > + > /* Get jobid of current process from stored variable or calculate > * it from pid and user_id. > * > @@ -108,6 +246,17 @@ int lustre_get_jobid(char *jobid) > goto out_cache_jobid; > } > > + if (strcmp(obd_jobid_var, JOBSTATS_MANUAL) == 0) { > + char *jid; > + rcu_read_lock(); > + jid = jobid_current(); > + if (jid) > + strlcpy(tmp_jobid, jid, sizeof(tmp_jobid)); > + rcu_read_unlock(); > + if (jid) > + goto out_cache_jobid; > + } > + > return -ENOENT; > > out_cache_jobid: > @@ -663,10 +812,13 @@ static int __init obdclass_init(void) > if (err) > goto cleanup_zombie_impexp; > > + err = rhashtable_init(&session_jobids, &jobid_params); > + if (err) > + goto cleanup_class_handle; > err = misc_register(&obd_psdev); > if (err) { > CERROR("cannot register OBD miscdevices: err %d\n", err); > - goto cleanup_class_handle; > + goto cleanup_session_jobids; > } > > /* Default the dirty page cache cap to 1/2 of system memory. > @@ -724,6 +876,9 @@ static int __init obdclass_init(void) > cleanup_deregister: > misc_deregister(&obd_psdev); > > +cleanup_session_jobids: > + rhashtable_free_and_destroy(&session_jobids, jobid_free, NULL); > + > cleanup_class_handle: > class_handle_cleanup(); > > @@ -743,6 +898,9 @@ static void obdclass_exit(void) > cl_global_fini(); > lu_global_fini(); > > + cancel_delayed_work_sync(&jobid_prune_work); > + rhashtable_free_and_destroy(&session_jobids, jobid_free, NULL); > + > obd_cleanup_caches(); > > class_procfs_clean(); > diff --git a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c > index 69ccc6a55947..112782e56793 100644 > --- a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c > +++ b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c > @@ -220,6 +220,7 @@ static ssize_t jobid_var_store(struct kobject *kobj, struct attribute *attr, > JOBSTATS_DISABLE, > JOBSTATS_PROCNAME_UID, > JOBSTATS_NODELOCAL, > + JOBSTATS_MANUAL, > NULL > }; > int i; > @@ -263,6 +264,44 @@ static ssize_t jobid_name_store(struct kobject *kobj, struct attribute *attr, > return count; > } > > +static ssize_t jobid_this_session_show(struct kobject *kobj, > + struct attribute *attr, > + char *buf) > +{ > + char *jid; > + int ret = -ENOENT; > + > + rcu_read_lock(); > + jid = jobid_current(); > + if (jid) > + ret = snprintf(buf, PAGE_SIZE, "%s\n", jid); > + rcu_read_unlock(); > + return ret; > +} > + > +static ssize_t jobid_this_session_store(struct kobject *kobj, > + struct attribute *attr, > + const char *buffer, > + size_t count) > +{ > + char *jobid; > + int len; > + int ret; > + > + if (!count || count > LUSTRE_JOBID_SIZE) > + return -EINVAL; > + > + jobid = kstrndup(buffer, count, GFP_KERNEL); > + if (!jobid) > + return -ENOMEM; > + len = strcspn(jobid, " \n"); > + jobid[len] = '\0'; > + ret = jobid_set_current(jobid); > + kfree(jobid); > + > + return ret ?: count; > +} > + > /* Root for /sys/kernel/debug/lustre */ > struct dentry *debugfs_lustre_root; > EXPORT_SYMBOL_GPL(debugfs_lustre_root); > @@ -272,6 +311,7 @@ LUSTRE_RO_ATTR(pinger); > LUSTRE_RO_ATTR(health_check); > LUSTRE_RW_ATTR(jobid_var); > LUSTRE_RW_ATTR(jobid_name); > +LUSTRE_RW_ATTR(jobid_this_session); > > static struct attribute *lustre_attrs[] = { > &lustre_attr_version.attr, > @@ -279,6 +319,7 @@ static struct attribute *lustre_attrs[] = { > &lustre_attr_health_check.attr, > &lustre_attr_jobid_name.attr, > &lustre_attr_jobid_var.attr, > + &lustre_attr_jobid_this_session.attr, > &lustre_sattr_timeout.u.attr, > &lustre_attr_max_dirty_mb.attr, > &lustre_sattr_debug_peer_on_timeout.u.attr, > -- > 2.14.0.rc0.dirty > Cheers, Andreas --- Andreas Dilger CTO Whamcloud
The definition of "session" matters here. It is increasingly common to run multiple jobs on a node at the same time, and the desire to distinguish between them is also significant. So we can't have just one session. I believe this is part of why it was chosen to put this information in user space, associated with a particular process. So we need a solution that can meet that requirement or the broader HPC world will just ignore us. Perhaps we could use the new jobid cache functionality here. A sketch: Have the parent - I believe there would be a unique parent for every set of job processes - register the ID (via this proc variable) and we would apply it to all children (what about grandchildren, etc...). Then set timeouts to something like 24 or 48 hours, and but also use (or add if it's not present?) the functionality to deregister job ids at job exit. So the timeout is only for cleanup. Although, the more I talk about this, the more this feels like something that should live in struct_task and be set from user space and managed by the kernel...
Hi Patrick, I agree that "something that should live in struct_task" is the sort of answer that would be nice, but I don't think it can be justified. A "session" in the sense of "getsid" and "setsid" is traditionally a login session, but since the invention of windowing systems it is one terminal window. So if you grep NSsid /proc/self/status in different windows you will see different session ids. All processes in that window will have the same session id unless something explicitly creates a new session - few things do that. If you want to explicitly create a new session, you can use "setsid". So setsid grep NSsid /proc/self/status will always show a new session id. Considering you sketch, the "unique parent" is already very likely to be in a session that no other job is in. If you run multiple jobs from the one terminal window (is they likely?) then the will all be in the same session unless you run setsid run-my-job The session (and anything we associated with it) will apply to all children and grandchild etc. There is no need for a timeout beyond what the code already has - it checks every 5 minutes to see if the session is still active, and forgets any inactive session. There is no risk of a new session being confused with an old session that no longer exists, even if they have the same number. I hope that clarifies for you. Thanks, NeilBrown On Fri, Mar 01 2019, Patrick Farrell wrote: > The definition of "session" matters here. It is increasingly common to run multiple jobs on a node at the same time, and the desire to distinguish between them is also significant. So we can't have just one session. > > I believe this is part of why it was chosen to put this information in user space, associated with a particular process. > > So we need a solution that can meet that requirement or the broader HPC world will just ignore us. > > Perhaps we could use the new jobid cache functionality here. A sketch: > > Have the parent - I believe there would be a unique parent for every set of job processes - register the ID (via this proc variable) and we would apply it to all children (what about grandchildren, etc...). Then set timeouts to something like 24 or 48 hours, and but also use (or add if it's not present?) the functionality to deregister job ids at job exit. So the timeout is only for cleanup. > > Although, the more I talk about this, the more this feels like something that should live in struct_task and be set from user space and managed by the kernel... > ________________________________ > From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Andreas Dilger <adilger@whamcloud.com> > Sent: Friday, March 1, 2019 2:32:00 AM > To: NeilBrown > Cc: Lustre Development List > Subject: Re: [lustre-devel] [PATCH 37/37] lustre: obd_sysfs: error-check value stored in jobid_var > > On Feb 28, 2019, at 19:35, NeilBrown <neilb@suse.com> wrote: >> >> On Wed, Feb 27 2019, Andreas Dilger wrote: >> >>> On Feb 18, 2019, at 17:09, NeilBrown <neilb@suse.com> wrote: >>>> >>>> The jobid_var sysfs attribute only has 3 meaningful values. >>>> Other values cause lustre_get_jobid() to return an error >>>> which is uniformly ignored. >>>> >>>> To improve usability and resilience, check that the value >>>> written is acceptable before storing it. >>>> >>>> Signed-off-by: NeilBrown <neilb@suse.com> >>> >>> This will no longer be true once https://review.whamcloud.com/31691 >>> commit 6488c0ec57de ("LU-10698 obdclass: allow specifying complex jobids") >> >> Actually it will. That patch changes the use of jobid_name, my patch >> restricts the values of jobid_var. >> >> I just realized why it is called "jobid_var" - in OpenSFS lustre, it can >> be an environment variable name. In drivers/staging lustre it cannot, >> so the name is a little odd. >> >>> >>> Currently the "%j" function was removed from the kernel client, even >>> though there is no technical reason it can't work (i.e. all of the code >>> to implement it is available and exported). This is actually super >>> useful for HPC cluster administrators to monitor per-job IO bandwidth >>> and IOPS on the server, and something that I think should be restored. >> >> I think that you probably need to let go of that desire - I don't think >> it is going to happen. While the code may, as you say, work - it is >> easy to dislike that approach, and would be hard to push against such >> resistance. >> >> I have an alternate approach, patch below. Instead of >> export LUSTRE_JOBID=foobar >> and process can run >> echo foobar > /sys/fs/lustre/jobid_this_session >> >> and it will affect all processes in the current "session". >> >> Could you warm to this approach at all? > > This is the best alternative that I've seen so far. That said, the choice > of storing the JobID as an environment variable wasn't something that we > invented ourselves, but rather this is what the various job schedulers do > when a job is first started, and the code to extract the JobID out of the > process environment was what we implemented to work within the constraints > that were given to us. The benefit of course is that we don't depend on > some external process to be run for every process to extract the environment > variable from the kernel memory and send it back to us. > > Allowing different processes to have different JobIDs on a single node is > definitely a requirement for most environments these days, and so far this > is the only solution that has allowed that to work. > > James is more familiar with the production side of the house, so I'll let > him comment on how easy/hard it would be to get this kind of change added > to the per-job preamble script so that it is set for all processes. > > Cheers, Andreas > >> From: NeilBrown <neilb@suse.com> >> Subject: [PATCH] lustre: obdclass: allow per-session jobids. >> >> Lustre includes a jobid in all RPC message sent to the server. This >> is used to collected per-job statistics, where a "job" can involve >> multiple processes on multiple nodes in a cluster. >> >> Nodes in a cluster can be running processes for multiple jobs, so it >> is best if different processes can have different jobids, and that >> processes on different nodes can have the same job id. >> >> This is not currently possible with the drivers/staging code. >> >> Lustre traditionally uses an environment variable to name a job, but >> having the kernel reach into the address space of a process to find >> that environment variable is seen by some developers to be an >> unacceptable design choice. >> >> This patch provides an alternate method, leveraging the concept of a >> "session id", as set with setsid(). Each login session already gets a >> unique sid which is preserved for all processes in that session unless >> explicitly changed (with setsid(1)). >> When a process in a session writes to >> /sys/fs/lustre/jobid_this_session, the string becomes the name for >> that session. >> If jobid_var is set to "manual", then the per-session jobid is used >> for the jobid for all requests from processes in that session. >> >> When a session ends, the jobid information will be purged within 5 >> minutes. >> >> Signed-off-by: NeilBrown <neilb@suse.com> >> --- >> .../staging/lustre/lustre/include/lprocfs_status.h | 1 + >> drivers/staging/lustre/lustre/include/obd_class.h | 3 + >> drivers/staging/lustre/lustre/obdclass/class_obd.c | 160 ++++++++++++++++++++- >> drivers/staging/lustre/lustre/obdclass/obd_sysfs.c | 41 ++++++ >> 4 files changed, 204 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/staging/lustre/lustre/include/lprocfs_status.h b/drivers/staging/lustre/lustre/include/lprocfs_status.h >> index 8565c28f08ee..1335a5722903 100644 >> --- a/drivers/staging/lustre/lustre/include/lprocfs_status.h >> +++ b/drivers/staging/lustre/lustre/include/lprocfs_status.h >> @@ -370,6 +370,7 @@ static inline void s2dhms(struct dhms *ts, time64_t secs64) >> #define JOBSTATS_DISABLE "disable" >> #define JOBSTATS_PROCNAME_UID "procname_uid" >> #define JOBSTATS_NODELOCAL "nodelocal" >> +#define JOBSTATS_MANUAL "manual" >> >> /* obd_config.c */ >> void lustre_register_client_process_config(int (*cpc)(struct lustre_cfg *lcfg)); >> diff --git a/drivers/staging/lustre/lustre/include/obd_class.h b/drivers/staging/lustre/lustre/include/obd_class.h >> index 50b08c89ecc5..08003f3dd467 100644 >> --- a/drivers/staging/lustre/lustre/include/obd_class.h >> +++ b/drivers/staging/lustre/lustre/include/obd_class.h >> @@ -55,6 +55,9 @@ extern rwlock_t obd_dev_lock; >> struct obd_device *class_exp2obd(struct obd_export *exp); >> int class_handle_ioctl(unsigned int cmd, unsigned long arg); >> int lustre_get_jobid(char *jobid); >> +char *jobid_current(void); >> +int jobid_set_current(char *jobid); >> + >> >> struct lu_device_type; >> >> diff --git a/drivers/staging/lustre/lustre/obdclass/class_obd.c b/drivers/staging/lustre/lustre/obdclass/class_obd.c >> index 1fcbda128a58..19ce3c858e59 100644 >> --- a/drivers/staging/lustre/lustre/obdclass/class_obd.c >> +++ b/drivers/staging/lustre/lustre/obdclass/class_obd.c >> @@ -79,6 +79,144 @@ EXPORT_SYMBOL(at_extra); >> char obd_jobid_var[JOBSTATS_JOBID_VAR_MAX_LEN + 1] = JOBSTATS_DISABLE; >> char obd_jobid_node[LUSTRE_JOBID_SIZE + 1]; >> >> +/* >> + * Jobid can be set for a session (see setsid(2)) by writing to >> + * a sysfs file from any process in that session. >> + * The jobids are stored in a hash table indexed by the relevant >> + * struct pid. We periodically look for entries where the pid has >> + * no PIDTYPE_SID tasks any more, and prune them. This happens within >> + * 5 seconds of a jobid being added, and every 5 minutes when jobids exist, >> + * but none are added. >> + */ >> +#define JOBID_EXPEDITED_CLEAN (5 * HZ) >> +#define JOBID_BACKGROUND_CLEAN (5 * 60 * HZ) >> + >> +struct session_jobid { >> + struct pid *session; >> + struct rhash_head linkage; >> + struct rcu_head rcu; >> + char jobid[1]; >> +}; >> + >> +const static struct rhashtable_params jobid_params = { >> + .key_len = sizeof(struct pid *), >> + .key_offset = offsetof(struct session_jobid, session), >> + .head_offset = offsetof(struct session_jobid, linkage), >> +}; >> +static struct rhashtable session_jobids; >> + >> +/* >> + * jobid_current must be called with rcu_read_lock held. >> + * if it returns non-NULL, the string can only be used >> + * until rcu_read_unlock is called. >> + */ >> +char *jobid_current(void) >> +{ >> + struct pid *sid = current->signal->pids[PIDTYPE_SID]; >> + struct session_jobid *sj; >> + >> + sj = rhashtable_lookup_fast(&session_jobids, &sid, jobid_params); >> + if (sj) >> + return sj->jobid; >> + return NULL; >> +} >> + >> +static void jobid_prune_expedite(void); >> +/* >> + * jobid_set_current will try to add a new entry >> + * to the table. If one exists with the same key, the >> + * jobid will be replaced >> + */ >> +int jobid_set_current(char *jobid) >> +{ >> + struct pid *sid = current->signal->pids[PIDTYPE_SID]; >> + struct session_jobid *sj, *origsj; >> + int ret; >> + >> + sj = kmalloc(sizeof(*sj) + strlen(jobid), GFP_KERNEL); >> + if (!sj) >> + return -ENOMEM; >> + rcu_read_lock(); >> + sj->session = get_pid(sid); >> + strcpy(sj->jobid, jobid); >> + origsj = rhashtable_lookup_get_insert_fast(&session_jobids, >> + &sj->linkage, >> + jobid_params); >> + if (origsj == NULL) { >> + /* successful insert */ >> + rcu_read_unlock(); >> + jobid_prune_expedite(); >> + return 0; >> + } >> + >> + if (IS_ERR(origsj)) { >> + put_pid(sj->session); >> + kfree(sj); >> + rcu_read_unlock(); >> + return PTR_ERR(origsj); >> + } >> + ret = rhashtable_replace_fast(&session_jobids, >> + &origsj->linkage, >> + &sj->linkage, >> + jobid_params); >> + if (ret) { >> + put_pid(sj->session); >> + kfree(sj); >> + rcu_read_unlock(); >> + return ret; >> + } >> + put_pid(origsj->session); >> + rcu_read_unlock(); >> + kfree_rcu(origsj, rcu); >> + jobid_prune_expedite(); >> + >> + return 0; >> +} >> + >> +static void jobid_free(void *vsj, void *arg) >> +{ >> + struct session_jobid *sj = vsj; >> + put_pid(sj->session); >> + kfree(sj); >> +} >> + >> +static void jobid_prune(struct work_struct *work); >> +static DECLARE_DELAYED_WORK(jobid_prune_work, jobid_prune); >> +static int jobid_prune_expedited; >> +static void jobid_prune(struct work_struct *work) >> +{ >> + int remaining = 0; >> + struct rhashtable_iter iter; >> + struct session_jobid *sj; >> + >> + jobid_prune_expedited = 0; >> + rhashtable_walk_enter(&session_jobids, &iter); >> + rhashtable_walk_start(&iter); >> + while ((sj = rhashtable_walk_next(&iter)) != NULL) { >> + if (!hlist_empty(&sj->session->tasks[PIDTYPE_SID])) { >> + remaining++; >> + continue; >> + } >> + if (rhashtable_remove_fast(&session_jobids, >> + &sj->linkage, jobid_params) == 0) { >> + put_pid(sj->session); >> + kfree_rcu(sj, rcu); >> + } >> + } >> + rhashtable_walk_stop(&iter); >> + rhashtable_walk_exit(&iter); >> + if (remaining) >> + schedule_delayed_work(&jobid_prune_work, JOBID_BACKGROUND_CLEAN); >> +} >> + >> +static void jobid_prune_expedite(void) >> +{ >> + if (!jobid_prune_expedited) { >> + jobid_prune_expedited = 1; >> + mod_delayed_work(system_wq, &jobid_prune_work, JOBID_EXPEDITED_CLEAN); >> + } >> +} >> + >> /* Get jobid of current process from stored variable or calculate >> * it from pid and user_id. >> * >> @@ -108,6 +246,17 @@ int lustre_get_jobid(char *jobid) >> goto out_cache_jobid; >> } >> >> + if (strcmp(obd_jobid_var, JOBSTATS_MANUAL) == 0) { >> + char *jid; >> + rcu_read_lock(); >> + jid = jobid_current(); >> + if (jid) >> + strlcpy(tmp_jobid, jid, sizeof(tmp_jobid)); >> + rcu_read_unlock(); >> + if (jid) >> + goto out_cache_jobid; >> + } >> + >> return -ENOENT; >> >> out_cache_jobid: >> @@ -663,10 +812,13 @@ static int __init obdclass_init(void) >> if (err) >> goto cleanup_zombie_impexp; >> >> + err = rhashtable_init(&session_jobids, &jobid_params); >> + if (err) >> + goto cleanup_class_handle; >> err = misc_register(&obd_psdev); >> if (err) { >> CERROR("cannot register OBD miscdevices: err %d\n", err); >> - goto cleanup_class_handle; >> + goto cleanup_session_jobids; >> } >> >> /* Default the dirty page cache cap to 1/2 of system memory. >> @@ -724,6 +876,9 @@ static int __init obdclass_init(void) >> cleanup_deregister: >> misc_deregister(&obd_psdev); >> >> +cleanup_session_jobids: >> + rhashtable_free_and_destroy(&session_jobids, jobid_free, NULL); >> + >> cleanup_class_handle: >> class_handle_cleanup(); >> >> @@ -743,6 +898,9 @@ static void obdclass_exit(void) >> cl_global_fini(); >> lu_global_fini(); >> >> + cancel_delayed_work_sync(&jobid_prune_work); >> + rhashtable_free_and_destroy(&session_jobids, jobid_free, NULL); >> + >> obd_cleanup_caches(); >> >> class_procfs_clean(); >> diff --git a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c >> index 69ccc6a55947..112782e56793 100644 >> --- a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c >> +++ b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c >> @@ -220,6 +220,7 @@ static ssize_t jobid_var_store(struct kobject *kobj, struct attribute *attr, >> JOBSTATS_DISABLE, >> JOBSTATS_PROCNAME_UID, >> JOBSTATS_NODELOCAL, >> + JOBSTATS_MANUAL, >> NULL >> }; >> int i; >> @@ -263,6 +264,44 @@ static ssize_t jobid_name_store(struct kobject *kobj, struct attribute *attr, >> return count; >> } >> >> +static ssize_t jobid_this_session_show(struct kobject *kobj, >> + struct attribute *attr, >> + char *buf) >> +{ >> + char *jid; >> + int ret = -ENOENT; >> + >> + rcu_read_lock(); >> + jid = jobid_current(); >> + if (jid) >> + ret = snprintf(buf, PAGE_SIZE, "%s\n", jid); >> + rcu_read_unlock(); >> + return ret; >> +} >> + >> +static ssize_t jobid_this_session_store(struct kobject *kobj, >> + struct attribute *attr, >> + const char *buffer, >> + size_t count) >> +{ >> + char *jobid; >> + int len; >> + int ret; >> + >> + if (!count || count > LUSTRE_JOBID_SIZE) >> + return -EINVAL; >> + >> + jobid = kstrndup(buffer, count, GFP_KERNEL); >> + if (!jobid) >> + return -ENOMEM; >> + len = strcspn(jobid, " \n"); >> + jobid[len] = '\0'; >> + ret = jobid_set_current(jobid); >> + kfree(jobid); >> + >> + return ret ?: count; >> +} >> + >> /* Root for /sys/kernel/debug/lustre */ >> struct dentry *debugfs_lustre_root; >> EXPORT_SYMBOL_GPL(debugfs_lustre_root); >> @@ -272,6 +311,7 @@ LUSTRE_RO_ATTR(pinger); >> LUSTRE_RO_ATTR(health_check); >> LUSTRE_RW_ATTR(jobid_var); >> LUSTRE_RW_ATTR(jobid_name); >> +LUSTRE_RW_ATTR(jobid_this_session); >> >> static struct attribute *lustre_attrs[] = { >> &lustre_attr_version.attr, >> @@ -279,6 +319,7 @@ static struct attribute *lustre_attrs[] = { >> &lustre_attr_health_check.attr, >> &lustre_attr_jobid_name.attr, >> &lustre_attr_jobid_var.attr, >> + &lustre_attr_jobid_this_session.attr, >> &lustre_sattr_timeout.u.attr, >> &lustre_attr_max_dirty_mb.attr, >> &lustre_sattr_debug_peer_on_timeout.u.attr, >> -- >> 2.14.0.rc0.dirty >> > > Cheers, Andreas > --- > Andreas Dilger > CTO Whamcloud > > > > > _______________________________________________ > lustre-devel mailing list > lustre-devel@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
So can we set session ids? Can we make them the same across many nodes? It doesn't sound like it. That's a basic requirement. We need more than just "kernel provided ID which is unique to process set on a node". The point of jobstats is to be able to collect info on all of a job, and many jobs will span many nodes. (We also have > 1 job per node in other cases.) - Patrick
On Thu, Mar 14 2019, Patrick Farrell wrote: > So can we set session ids? Can we make them the same across many > nodes? It doesn't sound like it. That's a basic requirement. We > need more than just "kernel provided ID which is unique to process set > on a node". The point of jobstats is to be able to collect info on > all of a job, and many jobs will span many nodes. (We also have > 1 > job per node in other cases.) This is where my patch some in. You cannot choose the session-id, but my patch allows you to associate an arbitrary string with a given session id. Then all processes in that session will use the string for tagging RPCs. The current approach is to store the string in the environment, and this gets copied from parent to child. My approach is to store the string in a hash table in the kernel associated with the session-id. The session-id already gets copied from parent to child. So the result is the same. An arbitrary string gets associated with all processes in a job. If multiple sessions, even on multiple hosts, are al in the one job, it is simple to associated the same arbitrary string with all the different session ids. NeilBrown > > > - Patrick > > ________________________________ > From: NeilBrown <neilb@suse.com> > Sent: Wednesday, March 13, 2019 7:34:05 PM > To: Patrick Farrell; Andreas Dilger > Cc: Lustre Development List > Subject: Re: [PATCH 37/37] lustre: obd_sysfs: error-check value stored in jobid_var > > > Hi Patrick, > I agree that "something that should live in struct_task" is the sort of > answer that would be nice, but I don't think it can be justified. > > A "session" in the sense of "getsid" and "setsid" is traditionally a > login session, but since the invention of windowing systems it is one > terminal window. > So if you > grep NSsid /proc/self/status > in different windows you will see different session ids. All processes > in that window will have the same session id unless something > explicitly creates a new session - few things do that. > > If you want to explicitly create a new session, you can use "setsid". > So > setsid grep NSsid /proc/self/status > will always show a new session id. > > Considering you sketch, the "unique parent" is already very likely to > be in a session that no other job is in. > If you run multiple jobs from the one terminal window (is they likely?) > then the will all be in the same session unless you run > setsid run-my-job > > The session (and anything we associated with it) will apply to all > children and grandchild etc. > There is no need for a timeout beyond what the code already has - it > checks every 5 minutes to see if the session is still active, and > forgets any inactive session. There is no risk of a new session being > confused with an old session that no longer exists, even if they have > the same number. > > I hope that clarifies for you. > > Thanks, > NeilBrown > > > On Fri, Mar 01 2019, Patrick Farrell wrote: > >> The definition of "session" matters here. It is increasingly common to run multiple jobs on a node at the same time, and the desire to distinguish between them is also significant. So we can't have just one session. >> >> I believe this is part of why it was chosen to put this information in user space, associated with a particular process. >> >> So we need a solution that can meet that requirement or the broader HPC world will just ignore us. >> >> Perhaps we could use the new jobid cache functionality here. A sketch: >> >> Have the parent - I believe there would be a unique parent for every set of job processes - register the ID (via this proc variable) and we would apply it to all children (what about grandchildren, etc...). Then set timeouts to something like 24 or 48 hours, and but also use (or add if it's not present?) the functionality to deregister job ids at job exit. So the timeout is only for cleanup. >> >> Although, the more I talk about this, the more this feels like something that should live in struct_task and be set from user space and managed by the kernel... >> ________________________________ >> From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Andreas Dilger <adilger@whamcloud.com> >> Sent: Friday, March 1, 2019 2:32:00 AM >> To: NeilBrown >> Cc: Lustre Development List >> Subject: Re: [lustre-devel] [PATCH 37/37] lustre: obd_sysfs: error-check value stored in jobid_var >> >> On Feb 28, 2019, at 19:35, NeilBrown <neilb@suse.com> wrote: >>> >>> On Wed, Feb 27 2019, Andreas Dilger wrote: >>> >>>> On Feb 18, 2019, at 17:09, NeilBrown <neilb@suse.com> wrote: >>>>> >>>>> The jobid_var sysfs attribute only has 3 meaningful values. >>>>> Other values cause lustre_get_jobid() to return an error >>>>> which is uniformly ignored. >>>>> >>>>> To improve usability and resilience, check that the value >>>>> written is acceptable before storing it. >>>>> >>>>> Signed-off-by: NeilBrown <neilb@suse.com> >>>> >>>> This will no longer be true once https://review.whamcloud.com/31691 >>>> commit 6488c0ec57de ("LU-10698 obdclass: allow specifying complex jobids") >>> >>> Actually it will. That patch changes the use of jobid_name, my patch >>> restricts the values of jobid_var. >>> >>> I just realized why it is called "jobid_var" - in OpenSFS lustre, it can >>> be an environment variable name. In drivers/staging lustre it cannot, >>> so the name is a little odd. >>> >>>> >>>> Currently the "%j" function was removed from the kernel client, even >>>> though there is no technical reason it can't work (i.e. all of the code >>>> to implement it is available and exported). This is actually super >>>> useful for HPC cluster administrators to monitor per-job IO bandwidth >>>> and IOPS on the server, and something that I think should be restored. >>> >>> I think that you probably need to let go of that desire - I don't think >>> it is going to happen. While the code may, as you say, work - it is >>> easy to dislike that approach, and would be hard to push against such >>> resistance. >>> >>> I have an alternate approach, patch below. Instead of >>> export LUSTRE_JOBID=foobar >>> and process can run >>> echo foobar > /sys/fs/lustre/jobid_this_session >>> >>> and it will affect all processes in the current "session". >>> >>> Could you warm to this approach at all? >> >> This is the best alternative that I've seen so far. That said, the choice >> of storing the JobID as an environment variable wasn't something that we >> invented ourselves, but rather this is what the various job schedulers do >> when a job is first started, and the code to extract the JobID out of the >> process environment was what we implemented to work within the constraints >> that were given to us. The benefit of course is that we don't depend on >> some external process to be run for every process to extract the environment >> variable from the kernel memory and send it back to us. >> >> Allowing different processes to have different JobIDs on a single node is >> definitely a requirement for most environments these days, and so far this >> is the only solution that has allowed that to work. >> >> James is more familiar with the production side of the house, so I'll let >> him comment on how easy/hard it would be to get this kind of change added >> to the per-job preamble script so that it is set for all processes. >> >> Cheers, Andreas >> >>> From: NeilBrown <neilb@suse.com> >>> Subject: [PATCH] lustre: obdclass: allow per-session jobids. >>> >>> Lustre includes a jobid in all RPC message sent to the server. This >>> is used to collected per-job statistics, where a "job" can involve >>> multiple processes on multiple nodes in a cluster. >>> >>> Nodes in a cluster can be running processes for multiple jobs, so it >>> is best if different processes can have different jobids, and that >>> processes on different nodes can have the same job id. >>> >>> This is not currently possible with the drivers/staging code. >>> >>> Lustre traditionally uses an environment variable to name a job, but >>> having the kernel reach into the address space of a process to find >>> that environment variable is seen by some developers to be an >>> unacceptable design choice. >>> >>> This patch provides an alternate method, leveraging the concept of a >>> "session id", as set with setsid(). Each login session already gets a >>> unique sid which is preserved for all processes in that session unless >>> explicitly changed (with setsid(1)). >>> When a process in a session writes to >>> /sys/fs/lustre/jobid_this_session, the string becomes the name for >>> that session. >>> If jobid_var is set to "manual", then the per-session jobid is used >>> for the jobid for all requests from processes in that session. >>> >>> When a session ends, the jobid information will be purged within 5 >>> minutes. >>> >>> Signed-off-by: NeilBrown <neilb@suse.com> >>> --- >>> .../staging/lustre/lustre/include/lprocfs_status.h | 1 + >>> drivers/staging/lustre/lustre/include/obd_class.h | 3 + >>> drivers/staging/lustre/lustre/obdclass/class_obd.c | 160 ++++++++++++++++++++- >>> drivers/staging/lustre/lustre/obdclass/obd_sysfs.c | 41 ++++++ >>> 4 files changed, 204 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/staging/lustre/lustre/include/lprocfs_status.h b/drivers/staging/lustre/lustre/include/lprocfs_status.h >>> index 8565c28f08ee..1335a5722903 100644 >>> --- a/drivers/staging/lustre/lustre/include/lprocfs_status.h >>> +++ b/drivers/staging/lustre/lustre/include/lprocfs_status.h >>> @@ -370,6 +370,7 @@ static inline void s2dhms(struct dhms *ts, time64_t secs64) >>> #define JOBSTATS_DISABLE "disable" >>> #define JOBSTATS_PROCNAME_UID "procname_uid" >>> #define JOBSTATS_NODELOCAL "nodelocal" >>> +#define JOBSTATS_MANUAL "manual" >>> >>> /* obd_config.c */ >>> void lustre_register_client_process_config(int (*cpc)(struct lustre_cfg *lcfg)); >>> diff --git a/drivers/staging/lustre/lustre/include/obd_class.h b/drivers/staging/lustre/lustre/include/obd_class.h >>> index 50b08c89ecc5..08003f3dd467 100644 >>> --- a/drivers/staging/lustre/lustre/include/obd_class.h >>> +++ b/drivers/staging/lustre/lustre/include/obd_class.h >>> @@ -55,6 +55,9 @@ extern rwlock_t obd_dev_lock; >>> struct obd_device *class_exp2obd(struct obd_export *exp); >>> int class_handle_ioctl(unsigned int cmd, unsigned long arg); >>> int lustre_get_jobid(char *jobid); >>> +char *jobid_current(void); >>> +int jobid_set_current(char *jobid); >>> + >>> >>> struct lu_device_type; >>> >>> diff --git a/drivers/staging/lustre/lustre/obdclass/class_obd.c b/drivers/staging/lustre/lustre/obdclass/class_obd.c >>> index 1fcbda128a58..19ce3c858e59 100644 >>> --- a/drivers/staging/lustre/lustre/obdclass/class_obd.c >>> +++ b/drivers/staging/lustre/lustre/obdclass/class_obd.c >>> @@ -79,6 +79,144 @@ EXPORT_SYMBOL(at_extra); >>> char obd_jobid_var[JOBSTATS_JOBID_VAR_MAX_LEN + 1] = JOBSTATS_DISABLE; >>> char obd_jobid_node[LUSTRE_JOBID_SIZE + 1]; >>> >>> +/* >>> + * Jobid can be set for a session (see setsid(2)) by writing to >>> + * a sysfs file from any process in that session. >>> + * The jobids are stored in a hash table indexed by the relevant >>> + * struct pid. We periodically look for entries where the pid has >>> + * no PIDTYPE_SID tasks any more, and prune them. This happens within >>> + * 5 seconds of a jobid being added, and every 5 minutes when jobids exist, >>> + * but none are added. >>> + */ >>> +#define JOBID_EXPEDITED_CLEAN (5 * HZ) >>> +#define JOBID_BACKGROUND_CLEAN (5 * 60 * HZ) >>> + >>> +struct session_jobid { >>> + struct pid *session; >>> + struct rhash_head linkage; >>> + struct rcu_head rcu; >>> + char jobid[1]; >>> +}; >>> + >>> +const static struct rhashtable_params jobid_params = { >>> + .key_len = sizeof(struct pid *), >>> + .key_offset = offsetof(struct session_jobid, session), >>> + .head_offset = offsetof(struct session_jobid, linkage), >>> +}; >>> +static struct rhashtable session_jobids; >>> + >>> +/* >>> + * jobid_current must be called with rcu_read_lock held. >>> + * if it returns non-NULL, the string can only be used >>> + * until rcu_read_unlock is called. >>> + */ >>> +char *jobid_current(void) >>> +{ >>> + struct pid *sid = current->signal->pids[PIDTYPE_SID]; >>> + struct session_jobid *sj; >>> + >>> + sj = rhashtable_lookup_fast(&session_jobids, &sid, jobid_params); >>> + if (sj) >>> + return sj->jobid; >>> + return NULL; >>> +} >>> + >>> +static void jobid_prune_expedite(void); >>> +/* >>> + * jobid_set_current will try to add a new entry >>> + * to the table. If one exists with the same key, the >>> + * jobid will be replaced >>> + */ >>> +int jobid_set_current(char *jobid) >>> +{ >>> + struct pid *sid = current->signal->pids[PIDTYPE_SID]; >>> + struct session_jobid *sj, *origsj; >>> + int ret; >>> + >>> + sj = kmalloc(sizeof(*sj) + strlen(jobid), GFP_KERNEL); >>> + if (!sj) >>> + return -ENOMEM; >>> + rcu_read_lock(); >>> + sj->session = get_pid(sid); >>> + strcpy(sj->jobid, jobid); >>> + origsj = rhashtable_lookup_get_insert_fast(&session_jobids, >>> + &sj->linkage, >>> + jobid_params); >>> + if (origsj == NULL) { >>> + /* successful insert */ >>> + rcu_read_unlock(); >>> + jobid_prune_expedite(); >>> + return 0; >>> + } >>> + >>> + if (IS_ERR(origsj)) { >>> + put_pid(sj->session); >>> + kfree(sj); >>> + rcu_read_unlock(); >>> + return PTR_ERR(origsj); >>> + } >>> + ret = rhashtable_replace_fast(&session_jobids, >>> + &origsj->linkage, >>> + &sj->linkage, >>> + jobid_params); >>> + if (ret) { >>> + put_pid(sj->session); >>> + kfree(sj); >>> + rcu_read_unlock(); >>> + return ret; >>> + } >>> + put_pid(origsj->session); >>> + rcu_read_unlock(); >>> + kfree_rcu(origsj, rcu); >>> + jobid_prune_expedite(); >>> + >>> + return 0; >>> +} >>> + >>> +static void jobid_free(void *vsj, void *arg) >>> +{ >>> + struct session_jobid *sj = vsj; >>> + put_pid(sj->session); >>> + kfree(sj); >>> +} >>> + >>> +static void jobid_prune(struct work_struct *work); >>> +static DECLARE_DELAYED_WORK(jobid_prune_work, jobid_prune); >>> +static int jobid_prune_expedited; >>> +static void jobid_prune(struct work_struct *work) >>> +{ >>> + int remaining = 0; >>> + struct rhashtable_iter iter; >>> + struct session_jobid *sj; >>> + >>> + jobid_prune_expedited = 0; >>> + rhashtable_walk_enter(&session_jobids, &iter); >>> + rhashtable_walk_start(&iter); >>> + while ((sj = rhashtable_walk_next(&iter)) != NULL) { >>> + if (!hlist_empty(&sj->session->tasks[PIDTYPE_SID])) { >>> + remaining++; >>> + continue; >>> + } >>> + if (rhashtable_remove_fast(&session_jobids, >>> + &sj->linkage, jobid_params) == 0) { >>> + put_pid(sj->session); >>> + kfree_rcu(sj, rcu); >>> + } >>> + } >>> + rhashtable_walk_stop(&iter); >>> + rhashtable_walk_exit(&iter); >>> + if (remaining) >>> + schedule_delayed_work(&jobid_prune_work, JOBID_BACKGROUND_CLEAN); >>> +} >>> + >>> +static void jobid_prune_expedite(void) >>> +{ >>> + if (!jobid_prune_expedited) { >>> + jobid_prune_expedited = 1; >>> + mod_delayed_work(system_wq, &jobid_prune_work, JOBID_EXPEDITED_CLEAN); >>> + } >>> +} >>> + >>> /* Get jobid of current process from stored variable or calculate >>> * it from pid and user_id. >>> * >>> @@ -108,6 +246,17 @@ int lustre_get_jobid(char *jobid) >>> goto out_cache_jobid; >>> } >>> >>> + if (strcmp(obd_jobid_var, JOBSTATS_MANUAL) == 0) { >>> + char *jid; >>> + rcu_read_lock(); >>> + jid = jobid_current(); >>> + if (jid) >>> + strlcpy(tmp_jobid, jid, sizeof(tmp_jobid)); >>> + rcu_read_unlock(); >>> + if (jid) >>> + goto out_cache_jobid; >>> + } >>> + >>> return -ENOENT; >>> >>> out_cache_jobid: >>> @@ -663,10 +812,13 @@ static int __init obdclass_init(void) >>> if (err) >>> goto cleanup_zombie_impexp; >>> >>> + err = rhashtable_init(&session_jobids, &jobid_params); >>> + if (err) >>> + goto cleanup_class_handle; >>> err = misc_register(&obd_psdev); >>> if (err) { >>> CERROR("cannot register OBD miscdevices: err %d\n", err); >>> - goto cleanup_class_handle; >>> + goto cleanup_session_jobids; >>> } >>> >>> /* Default the dirty page cache cap to 1/2 of system memory. >>> @@ -724,6 +876,9 @@ static int __init obdclass_init(void) >>> cleanup_deregister: >>> misc_deregister(&obd_psdev); >>> >>> +cleanup_session_jobids: >>> + rhashtable_free_and_destroy(&session_jobids, jobid_free, NULL); >>> + >>> cleanup_class_handle: >>> class_handle_cleanup(); >>> >>> @@ -743,6 +898,9 @@ static void obdclass_exit(void) >>> cl_global_fini(); >>> lu_global_fini(); >>> >>> + cancel_delayed_work_sync(&jobid_prune_work); >>> + rhashtable_free_and_destroy(&session_jobids, jobid_free, NULL); >>> + >>> obd_cleanup_caches(); >>> >>> class_procfs_clean(); >>> diff --git a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c >>> index 69ccc6a55947..112782e56793 100644 >>> --- a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c >>> +++ b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c >>> @@ -220,6 +220,7 @@ static ssize_t jobid_var_store(struct kobject *kobj, struct attribute *attr, >>> JOBSTATS_DISABLE, >>> JOBSTATS_PROCNAME_UID, >>> JOBSTATS_NODELOCAL, >>> + JOBSTATS_MANUAL, >>> NULL >>> }; >>> int i; >>> @@ -263,6 +264,44 @@ static ssize_t jobid_name_store(struct kobject *kobj, struct attribute *attr, >>> return count; >>> } >>> >>> +static ssize_t jobid_this_session_show(struct kobject *kobj, >>> + struct attribute *attr, >>> + char *buf) >>> +{ >>> + char *jid; >>> + int ret = -ENOENT; >>> + >>> + rcu_read_lock(); >>> + jid = jobid_current(); >>> + if (jid) >>> + ret = snprintf(buf, PAGE_SIZE, "%s\n", jid); >>> + rcu_read_unlock(); >>> + return ret; >>> +} >>> + >>> +static ssize_t jobid_this_session_store(struct kobject *kobj, >>> + struct attribute *attr, >>> + const char *buffer, >>> + size_t count) >>> +{ >>> + char *jobid; >>> + int len; >>> + int ret; >>> + >>> + if (!count || count > LUSTRE_JOBID_SIZE) >>> + return -EINVAL; >>> + >>> + jobid = kstrndup(buffer, count, GFP_KERNEL); >>> + if (!jobid) >>> + return -ENOMEM; >>> + len = strcspn(jobid, " \n"); >>> + jobid[len] = '\0'; >>> + ret = jobid_set_current(jobid); >>> + kfree(jobid); >>> + >>> + return ret ?: count; >>> +} >>> + >>> /* Root for /sys/kernel/debug/lustre */ >>> struct dentry *debugfs_lustre_root; >>> EXPORT_SYMBOL_GPL(debugfs_lustre_root); >>> @@ -272,6 +311,7 @@ LUSTRE_RO_ATTR(pinger); >>> LUSTRE_RO_ATTR(health_check); >>> LUSTRE_RW_ATTR(jobid_var); >>> LUSTRE_RW_ATTR(jobid_name); >>> +LUSTRE_RW_ATTR(jobid_this_session); >>> >>> static struct attribute *lustre_attrs[] = { >>> &lustre_attr_version.attr, >>> @@ -279,6 +319,7 @@ static struct attribute *lustre_attrs[] = { >>> &lustre_attr_health_check.attr, >>> &lustre_attr_jobid_name.attr, >>> &lustre_attr_jobid_var.attr, >>> + &lustre_attr_jobid_this_session.attr, >>> &lustre_sattr_timeout.u.attr, >>> &lustre_attr_max_dirty_mb.attr, >>> &lustre_sattr_debug_peer_on_timeout.u.attr, >>> -- >>> 2.14.0.rc0.dirty >>> >> >> Cheers, Andreas >> --- >> Andreas Dilger >> CTO Whamcloud >> >> >> >> >> _______________________________________________ >> lustre-devel mailing list >> lustre-devel@lists.lustre.org >> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
I think what Neil proposes is that the JobID be attached to the session, and then used by Lustre in the kernel from the session. Not quite the same as a per-process JobID, but in most cases that would be enough (if a process wants its own JobID it can always start a new session itself). It would allow multiple JobIDs for a single node, and with the "complex jobid" support it would still be possible to specify a per-session JobID with e.g. "%p", "%e", "%h", "%u", in the global "jobid_name" and use the per-session "jobid_this_session" for the "%j" value. The main difference is that the job scheduler preamble script would have to call "lctl set_param jobid_this_session=NNNN" before launching the user job, rather than not having to do anything beyond the existing "setenv". Cheers, Andreas > On Mar 14, 2019, at 08:12, Patrick Farrell <pfarrell@whamcloud.com> wrote: > > So can we set session ids? Can we make them the same across many nodes? It doesn't sound like it. That's a basic requirement. We need more than just "kernel provided ID which is unique to process set on a node". The point of jobstats is to be able to collect info on all of a job, and many jobs will span many nodes. (We also have > 1 job per node in other cases.) > > - Patrick >> From: NeilBrown <neilb@suse.com> >> Sent: Wednesday, March 13, 2019 7:34:05 PM >> To: Patrick Farrell; Andreas Dilger >> Cc: Lustre Development List >> Subject: Re: [PATCH 37/37] lustre: obd_sysfs: error-check value stored in jobid_var >> >> >> Hi Patrick, >> I agree that "something that should live in struct_task" is the sort of >> answer that would be nice, but I don't think it can be justified. >> >> A "session" in the sense of "getsid" and "setsid" is traditionally a >> login session, but since the invention of windowing systems it is one >> terminal window. >> So if you >> grep NSsid /proc/self/status >> in different windows you will see different session ids. All processes >> in that window will have the same session id unless something >> explicitly creates a new session - few things do that. >> >> If you want to explicitly create a new session, you can use "setsid". >> So >> setsid grep NSsid /proc/self/status >> will always show a new session id. >> >> Considering you sketch, the "unique parent" is already very likely to >> be in a session that no other job is in. >> If you run multiple jobs from the one terminal window (is they likely?) >> then the will all be in the same session unless you run >> setsid run-my-job >> >> The session (and anything we associated with it) will apply to all >> children and grandchild etc. >> There is no need for a timeout beyond what the code already has - it >> checks every 5 minutes to see if the session is still active, and >> forgets any inactive session. There is no risk of a new session being >> confused with an old session that no longer exists, even if they have >> the same number. >> >> I hope that clarifies for you. >> > Thanks, > NeilBrown > > > On Fri, Mar 01 2019, Patrick Farrell wrote: > > > The definition of "session" matters here. It is increasingly common to run multiple jobs on a node at the same time, and the desire to distinguish between them is also significant. So we can't have just one session. > > > > I believe this is part of why it was chosen to put this information in user space, associated with a particular process. > > > > So we need a solution that can meet that requirement or the broader HPC world will just ignore us. > > > > Perhaps we could use the new jobid cache functionality here. A sketch: > > > > Have the parent - I believe there would be a unique parent for every set of job processes - register the ID (via this proc variable) and we would apply it to all children (what about grandchildren, etc...). Then set timeouts to something like 24 or 48 hours, and but also use (or add if it's not present?) the functionality to deregister job ids at job exit. So the timeout is only for cleanup. > > > > Although, the more I talk about this, the more this feels like something that should live in struct_task and be set from user space and managed by the kernel... > > ________________________________ > > From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Andreas Dilger <adilger@whamcloud.com> > > Sent: Friday, March 1, 2019 2:32:00 AM > > To: NeilBrown > > Cc: Lustre Development List > > Subject: Re: [lustre-devel] [PATCH 37/37] lustre: obd_sysfs: error-check value stored in jobid_var > > > > On Feb 28, 2019, at 19:35, NeilBrown <neilb@suse.com> wrote: > >> > >> On Wed, Feb 27 2019, Andreas Dilger wrote: > >> > >>> On Feb 18, 2019, at 17:09, NeilBrown <neilb@suse.com> wrote: > >>>> > >>>> The jobid_var sysfs attribute only has 3 meaningful values. > >>>> Other values cause lustre_get_jobid() to return an error > >>>> which is uniformly ignored. > >>>> > >>>> To improve usability and resilience, check that the value > >>>> written is acceptable before storing it. > >>>> > >>>> Signed-off-by: NeilBrown <neilb@suse.com> > >>> > >>> This will no longer be true once https://review.whamcloud.com/31691 > >>> commit 6488c0ec57de ("LU-10698 obdclass: allow specifying complex jobids") > >> > >> Actually it will. That patch changes the use of jobid_name, my patch > >> restricts the values of jobid_var. > >> > >> I just realized why it is called "jobid_var" - in OpenSFS lustre, it can > >> be an environment variable name. In drivers/staging lustre it cannot, > >> so the name is a little odd. > >> > >>> > >>> Currently the "%j" function was removed from the kernel client, even > >>> though there is no technical reason it can't work (i.e. all of the code > >>> to implement it is available and exported). This is actually super > >>> useful for HPC cluster administrators to monitor per-job IO bandwidth > >>> and IOPS on the server, and something that I think should be restored. > >> > >> I think that you probably need to let go of that desire - I don't think > >> it is going to happen. While the code may, as you say, work - it is > >> easy to dislike that approach, and would be hard to push against such > >> resistance. > >> > >> I have an alternate approach, patch below. Instead of > >> export LUSTRE_JOBID=foobar > >> and process can run > >> echo foobar > /sys/fs/lustre/jobid_this_session > >> > >> and it will affect all processes in the current "session". > >> > >> Could you warm to this approach at all? > > > > This is the best alternative that I've seen so far. That said, the choice > > of storing the JobID as an environment variable wasn't something that we > > invented ourselves, but rather this is what the various job schedulers do > > when a job is first started, and the code to extract the JobID out of the > > process environment was what we implemented to work within the constraints > > that were given to us. The benefit of course is that we don't depend on > > some external process to be run for every process to extract the environment > > variable from the kernel memory and send it back to us. > > > > Allowing different processes to have different JobIDs on a single node is > > definitely a requirement for most environments these days, and so far this > > is the only solution that has allowed that to work. > > > > James is more familiar with the production side of the house, so I'll let > > him comment on how easy/hard it would be to get this kind of change added > > to the per-job preamble script so that it is set for all processes. > > > > Cheers, Andreas > > > >> From: NeilBrown <neilb@suse.com> > >> Subject: [PATCH] lustre: obdclass: allow per-session jobids. > >> > >> Lustre includes a jobid in all RPC message sent to the server. This > >> is used to collected per-job statistics, where a "job" can involve > >> multiple processes on multiple nodes in a cluster. > >> > >> Nodes in a cluster can be running processes for multiple jobs, so it > >> is best if different processes can have different jobids, and that > >> processes on different nodes can have the same job id. > >> > >> This is not currently possible with the drivers/staging code. > >> > >> Lustre traditionally uses an environment variable to name a job, but > >> having the kernel reach into the address space of a process to find > >> that environment variable is seen by some developers to be an > >> unacceptable design choice. > >> > >> This patch provides an alternate method, leveraging the concept of a > >> "session id", as set with setsid(). Each login session already gets a > >> unique sid which is preserved for all processes in that session unless > >> explicitly changed (with setsid(1)). > >> When a process in a session writes to > >> /sys/fs/lustre/jobid_this_session, the string becomes the name for > >> that session. > >> If jobid_var is set to "manual", then the per-session jobid is used > >> for the jobid for all requests from processes in that session. > >> > >> When a session ends, the jobid information will be purged within 5 > >> minutes. > >> > >> Signed-off-by: NeilBrown <neilb@suse.com> > >> --- > >> .../staging/lustre/lustre/include/lprocfs_status.h | 1 + > >> drivers/staging/lustre/lustre/include/obd_class.h | 3 + > >> drivers/staging/lustre/lustre/obdclass/class_obd.c | 160 ++++++++++++++++++++- > >> drivers/staging/lustre/lustre/obdclass/obd_sysfs.c | 41 ++++++ > >> 4 files changed, 204 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/staging/lustre/lustre/include/lprocfs_status.h b/drivers/staging/lustre/lustre/include/lprocfs_status.h > >> index 8565c28f08ee..1335a5722903 100644 > >> --- a/drivers/staging/lustre/lustre/include/lprocfs_status.h > >> +++ b/drivers/staging/lustre/lustre/include/lprocfs_status.h > >> @@ -370,6 +370,7 @@ static inline void s2dhms(struct dhms *ts, time64_t secs64) > >> #define JOBSTATS_DISABLE "disable" > >> #define JOBSTATS_PROCNAME_UID "procname_uid" > >> #define JOBSTATS_NODELOCAL "nodelocal" > >> +#define JOBSTATS_MANUAL "manual" > >> > >> /* obd_config.c */ > >> void lustre_register_client_process_config(int (*cpc)(struct lustre_cfg *lcfg)); > >> diff --git a/drivers/staging/lustre/lustre/include/obd_class.h b/drivers/staging/lustre/lustre/include/obd_class.h > >> index 50b08c89ecc5..08003f3dd467 100644 > >> --- a/drivers/staging/lustre/lustre/include/obd_class.h > >> +++ b/drivers/staging/lustre/lustre/include/obd_class.h > >> @@ -55,6 +55,9 @@ extern rwlock_t obd_dev_lock; > >> struct obd_device *class_exp2obd(struct obd_export *exp); > >> int class_handle_ioctl(unsigned int cmd, unsigned long arg); > >> int lustre_get_jobid(char *jobid); > >> +char *jobid_current(void); > >> +int jobid_set_current(char *jobid); > >> + > >> > >> struct lu_device_type; > >> > >> diff --git a/drivers/staging/lustre/lustre/obdclass/class_obd.c b/drivers/staging/lustre/lustre/obdclass/class_obd.c > >> index 1fcbda128a58..19ce3c858e59 100644 > >> --- a/drivers/staging/lustre/lustre/obdclass/class_obd.c > >> +++ b/drivers/staging/lustre/lustre/obdclass/class_obd.c > >> @@ -79,6 +79,144 @@ EXPORT_SYMBOL(at_extra); > >> char obd_jobid_var[JOBSTATS_JOBID_VAR_MAX_LEN + 1] = JOBSTATS_DISABLE; > >> char obd_jobid_node[LUSTRE_JOBID_SIZE + 1]; > >> > >> +/* > >> + * Jobid can be set for a session (see setsid(2)) by writing to > >> + * a sysfs file from any process in that session. > >> + * The jobids are stored in a hash table indexed by the relevant > >> + * struct pid. We periodically look for entries where the pid has > >> + * no PIDTYPE_SID tasks any more, and prune them. This happens within > >> + * 5 seconds of a jobid being added, and every 5 minutes when jobids exist, > >> + * but none are added. > >> + */ > >> +#define JOBID_EXPEDITED_CLEAN (5 * HZ) > >> +#define JOBID_BACKGROUND_CLEAN (5 * 60 * HZ) > >> + > >> +struct session_jobid { > >> + struct pid *session; > >> + struct rhash_head linkage; > >> + struct rcu_head rcu; > >> + char jobid[1]; > >> +}; > >> + > >> +const static struct rhashtable_params jobid_params = { > >> + .key_len = sizeof(struct pid *), > >> + .key_offset = offsetof(struct session_jobid, session), > >> + .head_offset = offsetof(struct session_jobid, linkage), > >> +}; > >> +static struct rhashtable session_jobids; > >> + > >> +/* > >> + * jobid_current must be called with rcu_read_lock held. > >> + * if it returns non-NULL, the string can only be used > >> + * until rcu_read_unlock is called. > >> + */ > >> +char *jobid_current(void) > >> +{ > >> + struct pid *sid = current->signal->pids[PIDTYPE_SID]; > >> + struct session_jobid *sj; > >> + > >> + sj = rhashtable_lookup_fast(&session_jobids, &sid, jobid_params); > >> + if (sj) > >> + return sj->jobid; > >> + return NULL; > >> +} > >> + > >> +static void jobid_prune_expedite(void); > >> +/* > >> + * jobid_set_current will try to add a new entry > >> + * to the table. If one exists with the same key, the > >> + * jobid will be replaced > >> + */ > >> +int jobid_set_current(char *jobid) > >> +{ > >> + struct pid *sid = current->signal->pids[PIDTYPE_SID]; > >> + struct session_jobid *sj, *origsj; > >> + int ret; > >> + > >> + sj = kmalloc(sizeof(*sj) + strlen(jobid), GFP_KERNEL); > >> + if (!sj) > >> + return -ENOMEM; > >> + rcu_read_lock(); > >> + sj->session = get_pid(sid); > >> + strcpy(sj->jobid, jobid); > >> + origsj = rhashtable_lookup_get_insert_fast(&session_jobids, > >> + &sj->linkage, > >> + jobid_params); > >> + if (origsj == NULL) { > >> + /* successful insert */ > >> + rcu_read_unlock(); > >> + jobid_prune_expedite(); > >> + return 0; > >> + } > >> + > >> + if (IS_ERR(origsj)) { > >> + put_pid(sj->session); > >> + kfree(sj); > >> + rcu_read_unlock(); > >> + return PTR_ERR(origsj); > >> + } > >> + ret = rhashtable_replace_fast(&session_jobids, > >> + &origsj->linkage, > >> + &sj->linkage, > >> + jobid_params); > >> + if (ret) { > >> + put_pid(sj->session); > >> + kfree(sj); > >> + rcu_read_unlock(); > >> + return ret; > >> + } > >> + put_pid(origsj->session); > >> + rcu_read_unlock(); > >> + kfree_rcu(origsj, rcu); > >> + jobid_prune_expedite(); > >> + > >> + return 0; > >> +} > >> + > >> +static void jobid_free(void *vsj, void *arg) > >> +{ > >> + struct session_jobid *sj = vsj; > >> + put_pid(sj->session); > >> + kfree(sj); > >> +} > >> + > >> +static void jobid_prune(struct work_struct *work); > >> +static DECLARE_DELAYED_WORK(jobid_prune_work, jobid_prune); > >> +static int jobid_prune_expedited; > >> +static void jobid_prune(struct work_struct *work) > >> +{ > >> + int remaining = 0; > >> + struct rhashtable_iter iter; > >> + struct session_jobid *sj; > >> + > >> + jobid_prune_expedited = 0; > >> + rhashtable_walk_enter(&session_jobids, &iter); > >> + rhashtable_walk_start(&iter); > >> + while ((sj = rhashtable_walk_next(&iter)) != NULL) { > >> + if (!hlist_empty(&sj->session->tasks[PIDTYPE_SID])) { > >> + remaining++; > >> + continue; > >> + } > >> + if (rhashtable_remove_fast(&session_jobids, > >> + &sj->linkage, jobid_params) == 0) { > >> + put_pid(sj->session); > >> + kfree_rcu(sj, rcu); > >> + } > >> + } > >> + rhashtable_walk_stop(&iter); > >> + rhashtable_walk_exit(&iter); > >> + if (remaining) > >> + schedule_delayed_work(&jobid_prune_work, JOBID_BACKGROUND_CLEAN); > >> +} > >> + > >> +static void jobid_prune_expedite(void) > >> +{ > >> + if (!jobid_prune_expedited) { > >> + jobid_prune_expedited = 1; > >> + mod_delayed_work(system_wq, &jobid_prune_work, JOBID_EXPEDITED_CLEAN); > >> + } > >> +} > >> + > >> /* Get jobid of current process from stored variable or calculate > >> * it from pid and user_id. > >> * > >> @@ -108,6 +246,17 @@ int lustre_get_jobid(char *jobid) > >> goto out_cache_jobid; > >> } > >> > >> + if (strcmp(obd_jobid_var, JOBSTATS_MANUAL) == 0) { > >> + char *jid; > >> + rcu_read_lock(); > >> + jid = jobid_current(); > >> + if (jid) > >> + strlcpy(tmp_jobid, jid, sizeof(tmp_jobid)); > >> + rcu_read_unlock(); > >> + if (jid) > >> + goto out_cache_jobid; > >> + } > >> + > >> return -ENOENT; > >> > >> out_cache_jobid: > >> @@ -663,10 +812,13 @@ static int __init obdclass_init(void) > >> if (err) > >> goto cleanup_zombie_impexp; > >> > >> + err = rhashtable_init(&session_jobids, &jobid_params); > >> + if (err) > >> + goto cleanup_class_handle; > >> err = misc_register(&obd_psdev); > >> if (err) { > >> CERROR("cannot register OBD miscdevices: err %d\n", err); > >> - goto cleanup_class_handle; > >> + goto cleanup_session_jobids; > >> } > >> > >> /* Default the dirty page cache cap to 1/2 of system memory. > >> @@ -724,6 +876,9 @@ static int __init obdclass_init(void) > >> cleanup_deregister: > >> misc_deregister(&obd_psdev); > >> > >> +cleanup_session_jobids: > >> + rhashtable_free_and_destroy(&session_jobids, jobid_free, NULL); > >> + > >> cleanup_class_handle: > >> class_handle_cleanup(); > >> > >> @@ -743,6 +898,9 @@ static void obdclass_exit(void) > >> cl_global_fini(); > >> lu_global_fini(); > >> > >> + cancel_delayed_work_sync(&jobid_prune_work); > >> + rhashtable_free_and_destroy(&session_jobids, jobid_free, NULL); > >> + > >> obd_cleanup_caches(); > >> > >> class_procfs_clean(); > >> diff --git a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c > >> index 69ccc6a55947..112782e56793 100644 > >> --- a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c > >> +++ b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c > >> @@ -220,6 +220,7 @@ static ssize_t jobid_var_store(struct kobject *kobj, struct attribute *attr, > >> JOBSTATS_DISABLE, > >> JOBSTATS_PROCNAME_UID, > >> JOBSTATS_NODELOCAL, > >> + JOBSTATS_MANUAL, > >> NULL > >> }; > >> int i; > >> @@ -263,6 +264,44 @@ static ssize_t jobid_name_store(struct kobject *kobj, struct attribute *attr, > >> return count; > >> } > >> > >> +static ssize_t jobid_this_session_show(struct kobject *kobj, > >> + struct attribute *attr, > >> + char *buf) > >> +{ > >> + char *jid; > >> + int ret = -ENOENT; > >> + > >> + rcu_read_lock(); > >> + jid = jobid_current(); > >> + if (jid) > >> + ret = snprintf(buf, PAGE_SIZE, "%s\n", jid); > >> + rcu_read_unlock(); > >> + return ret; > >> +} > >> + > >> +static ssize_t jobid_this_session_store(struct kobject *kobj, > >> + struct attribute *attr, > >> + const char *buffer, > >> + size_t count) > >> +{ > >> + char *jobid; > >> + int len; > >> + int ret; > >> + > >> + if (!count || count > LUSTRE_JOBID_SIZE) > >> + return -EINVAL; > >> + > >> + jobid = kstrndup(buffer, count, GFP_KERNEL); > >> + if (!jobid) > >> + return -ENOMEM; > >> + len = strcspn(jobid, " \n"); > >> + jobid[len] = '\0'; > >> + ret = jobid_set_current(jobid); > >> + kfree(jobid); > >> + > >> + return ret ?: count; > >> +} > >> + > >> /* Root for /sys/kernel/debug/lustre */ > >> struct dentry *debugfs_lustre_root; > >> EXPORT_SYMBOL_GPL(debugfs_lustre_root); > >> @@ -272,6 +311,7 @@ LUSTRE_RO_ATTR(pinger); > >> LUSTRE_RO_ATTR(health_check); > >> LUSTRE_RW_ATTR(jobid_var); > >> LUSTRE_RW_ATTR(jobid_name); > >> +LUSTRE_RW_ATTR(jobid_this_session); > >> > >> static struct attribute *lustre_attrs[] = { > >> &lustre_attr_version.attr, > >> @@ -279,6 +319,7 @@ static struct attribute *lustre_attrs[] = { > >> &lustre_attr_health_check.attr, > >> &lustre_attr_jobid_name.attr, > >> &lustre_attr_jobid_var.attr, > >> + &lustre_attr_jobid_this_session.attr, > >> &lustre_sattr_timeout.u.attr, > >> &lustre_attr_max_dirty_mb.attr, > >> &lustre_sattr_debug_peer_on_timeout.u.attr, > >> -- > >> 2.14.0.rc0.dirty > >> > > > > Cheers, Andreas > > --- > > Andreas Dilger > > CTO Whamcloud > > > > > > > > > > _______________________________________________ > > lustre-devel mailing list > > lustre-devel@lists.lustre.org > > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org Cheers, Andreas --- Andreas Dilger CTO Whamcloud
diff --git a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c index 57a6f2c2da1c..69ccc6a55947 100644 --- a/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c +++ b/drivers/staging/lustre/lustre/obdclass/obd_sysfs.c @@ -216,16 +216,25 @@ static ssize_t jobid_var_store(struct kobject *kobj, struct attribute *attr, const char *buffer, size_t count) { + static char *valid[] = { + JOBSTATS_DISABLE, + JOBSTATS_PROCNAME_UID, + JOBSTATS_NODELOCAL, + NULL + }; + int i; + if (!count || count > JOBSTATS_JOBID_VAR_MAX_LEN) return -EINVAL; - memset(obd_jobid_var, 0, JOBSTATS_JOBID_VAR_MAX_LEN + 1); - - memcpy(obd_jobid_var, buffer, count); + for (i = 0; valid[i]; i++) + if (sysfs_streq(buffer, valid[i])) + break; + if (!valid[i]) + return -EINVAL; - /* Trim the trailing '\n' if any */ - if (obd_jobid_var[count - 1] == '\n') - obd_jobid_var[count - 1] = 0; + memset(obd_jobid_var, 0, JOBSTATS_JOBID_VAR_MAX_LEN + 1); + strcpy(obd_jobid_var, valid[i]); return count; }
The jobid_var sysfs attribute only has 3 meaningful values. Other values cause lustre_get_jobid() to return an error which is uniformly ignored. To improve usability and resilience, check that the value written is acceptable before storing it. Signed-off-by: NeilBrown <neilb@suse.com> --- drivers/staging/lustre/lustre/obdclass/obd_sysfs.c | 21 ++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-)