Message ID | 20180323191614.32489-11-ebiederm@xmission.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 23 Mar 2018, Eric W. Biederman wrote: >Today the last process to update a semaphore is remembered and >reported in the pid namespace of that process. If there are processes >in any other pid namespace querying that process id with GETPID the >result will be unusable nonsense as it does not make any >sense in your own pid namespace. Yeah that sounds pretty wrong. > >Due to ipc_update_pid I don't think you will be able to get System V >ipc semaphores into a troublesome cache line ping-pong. Using struct >pids from separate process are not a problem because they do not share >a cache line. Using struct pid from different threads of the same >process are unlikely to be a problem as the reference count update >can be avoided. > >Further linux futexes are a much better tool for the job of mutual >exclusion between processes than System V semaphores. So I expect >programs that are performance limited by their interprocess mutual >exclusion primitive will be using futexes. You would be wrong. There are plenty of real workloads out there that do not use futexes and are care about performance; in the end futexes are only good for the uncontended cases, it can also destroy numa boxes if you consider the global hash table. Experience as shown me that sysvipc sems are quite still used. > >So while it is possible that enhancing the storage of the last >rocess of a System V semaphore from an integer to a struct pid >will cause a performance regression because of the effect >of frequently updating the pid reference count. I don't expect >that to happen in practice. How's that? Now thanks to ipc_update_pid() for each semop the user passes, perform_atomic_semop() will do two atomic updates for the cases where there are multiple processes updating the sem. This is not uncommon. Could you please provide some numbers. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 28 Mar 2018, Davidlohr Bueso wrote: >On Fri, 23 Mar 2018, Eric W. Biederman wrote: > >>Today the last process to update a semaphore is remembered and >>reported in the pid namespace of that process. If there are processes >>in any other pid namespace querying that process id with GETPID the >>result will be unusable nonsense as it does not make any >>sense in your own pid namespace. > >Yeah that sounds pretty wrong. > >> >>Due to ipc_update_pid I don't think you will be able to get System V >>ipc semaphores into a troublesome cache line ping-pong. Using struct >>pids from separate process are not a problem because they do not share >>a cache line. Using struct pid from different threads of the same >>process are unlikely to be a problem as the reference count update >>can be avoided. >> >>Further linux futexes are a much better tool for the job of mutual >>exclusion between processes than System V semaphores. So I expect >>programs that are performance limited by their interprocess mutual >>exclusion primitive will be using futexes. > >You would be wrong. There are plenty of real workloads out there >that do not use futexes and are care about performance; in the end >futexes are only good for the uncontended cases, it can also >destroy numa boxes if you consider the global hash table. Experience >as shown me that sysvipc sems are quite still used. > >> >>So while it is possible that enhancing the storage of the last >>rocess of a System V semaphore from an integer to a struct pid >>will cause a performance regression because of the effect >>of frequently updating the pid reference count. I don't expect >>that to happen in practice. > >How's that? Now thanks to ipc_update_pid() for each semop the user >passes, perform_atomic_semop() will do two atomic updates for the >cases where there are multiple processes updating the sem. This is >not uncommon. > >Could you please provide some numbers. I ran this on a 40-core (no ht) Westmere with two benchmarks. The first is Manfred's sysvsem lockunlock[1] program which uses _processes_ to, well, lock and unlock the semaphore. The options are a little unconventional, to keep the "critical region small" and the lock+unlock frequency high I added busy_in=busy_out=10. Similarly, to get the worst case scenario and have everyone update the same semaphore, a single one is used. Here are the results (pretty low stddev from run to run) for doing 100,000 lock+unlock. - 1 proc: * vanilla total execution time: 0.110638 seconds for 100000 loops * dirty total execution time: 0.120144 seconds for 100000 loops - 2 proc: * vanilla total execution time: 0.379756 seconds for 100000 loops * dirty total execution time: 0.477778 seconds for 100000 loops - 4 proc: * vanilla total execution time: 6.749710 seconds for 100000 loops * dirty total execution time: 4.651872 seconds for 100000 loops - 8 proc: * vanilla total execution time: 5.558404 seconds for 100000 loops * dirty total execution time: 7.143329 seconds for 100000 loops - 16 proc: * vanilla total execution time: 9.016398 seconds for 100000 loops * dirty total execution time: 9.412055 seconds for 100000 loops - 32 proc: * vanilla total execution time: 9.694451 seconds for 100000 loops * dirty total execution time: 9.990451 seconds for 100000 loops - 64 proc: * vanilla total execution time: 9.844984 seconds for 100032 loops * dirty total execution time: 10.016464 seconds for 100032 loops Lower task counts show pretty massive performance hits of ~9%, ~25% and ~30% for single, two and four/eight processes. As more are added I guess the overhead tends to disappear as for one you have a lot more locking contention going on. The second workload I ran this patch on was Chris Mason's sem-scalebench[2] program which uses _threads_ for the sysvsem option (this benchmark is more about semaphores as a concept rather than sysvsem specific). Dealing with a single semaphore and increasing thread counts we get: sembench-sem vanill dirt vanilla dirty Hmean sembench-sem-2 286272.00 ( 0.00%) 288232.00 ( 0.68%) Hmean sembench-sem-8 510966.00 ( 0.00%) 494375.00 ( -3.25%) Hmean sembench-sem-12 435753.00 ( 0.00%) 465328.00 ( 6.79%) Hmean sembench-sem-21 448144.00 ( 0.00%) 462091.00 ( 3.11%) Hmean sembench-sem-30 479519.00 ( 0.00%) 471295.00 ( -1.72%) Hmean sembench-sem-48 533270.00 ( 0.00%) 542525.00 ( 1.74%) Hmean sembench-sem-79 510218.00 ( 0.00%) 528392.00 ( 3.56%) Unsurprisingly, the thread case shows no overhead -- and yes, even better at times but still noise). Similarly, when completely abusing the systems and doing 64*NCPUS there is pretty much no difference: vanill dirt vanilla dirty User 1865.99 1819.75 System 35080.97 35396.34 Elapsed 3602.03 3560.50 So at least for a large box this patch hurts the cases where there is low to medium cpu usage (no more than ~8 processes on a 40 core box) in a non trivial way. For more processes it doesn't matter. We can confirm that the case for threads is irrelevant. While I'm not happy about the 30% regression I guess we can live with this. Manfred, any thoughts? Thanks Davidlohr [1] https://github.com/manfred-colorfu/ipcscale/blob/master/sem-lockunlock.c [2] https://github.com/davidlohr/sembench-ng/blob/master/sembench.c -- To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davidlohr Bueso <dave@stgolabs.net> writes: > I ran this on a 40-core (no ht) Westmere with two benchmarks. The first > is Manfred's sysvsem lockunlock[1] program which uses _processes_ to, > well, lock and unlock the semaphore. The options are a little > unconventional, to keep the "critical region small" and the lock+unlock > frequency high I added busy_in=busy_out=10. Similarly, to get the > worst case scenario and have everyone update the same semaphore, a single > one is used. Here are the results (pretty low stddev from run to run) > for doing 100,000 lock+unlock. > > - 1 proc: > * vanilla > total execution time: 0.110638 seconds for 100000 loops > * dirty > total execution time: 0.120144 seconds for 100000 loops > > - 2 proc: > * vanilla > total execution time: 0.379756 seconds for 100000 loops > * dirty > total execution time: 0.477778 seconds for 100000 loops > > - 4 proc: > * vanilla > total execution time: 6.749710 seconds for 100000 loops > * dirty > total execution time: 4.651872 seconds for 100000 loops > > - 8 proc: > * vanilla > total execution time: 5.558404 seconds for 100000 loops > * dirty > total execution time: 7.143329 seconds for 100000 loops > > - 16 proc: > * vanilla > total execution time: 9.016398 seconds for 100000 loops > * dirty > total execution time: 9.412055 seconds for 100000 loops > > - 32 proc: > * vanilla > total execution time: 9.694451 seconds for 100000 loops > * dirty > total execution time: 9.990451 seconds for 100000 loops > > - 64 proc: > * vanilla > total execution time: 9.844984 seconds for 100032 loops > * dirty > total execution time: 10.016464 seconds for 100032 loops > > Lower task counts show pretty massive performance hits of ~9%, ~25% > and ~30% for single, two and four/eight processes. As more are added > I guess the overhead tends to disappear as for one you have a lot > more locking contention going on. Can you check your notes on the 4 process case? As I read the 4 process case above it is ~30% improvement. Either that is a typo or there is the potential for quite a bit of noise in the test case. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 30 Mar 2018, Eric W. Biederman wrote: >Davidlohr Bueso <dave@stgolabs.net> writes: > >> I ran this on a 40-core (no ht) Westmere with two benchmarks. The first >> is Manfred's sysvsem lockunlock[1] program which uses _processes_ to, >> well, lock and unlock the semaphore. The options are a little >> unconventional, to keep the "critical region small" and the lock+unlock >> frequency high I added busy_in=busy_out=10. Similarly, to get the >> worst case scenario and have everyone update the same semaphore, a single >> one is used. Here are the results (pretty low stddev from run to run) >> for doing 100,000 lock+unlock. >> >> - 1 proc: >> * vanilla >> total execution time: 0.110638 seconds for 100000 loops >> * dirty >> total execution time: 0.120144 seconds for 100000 loops >> >> - 2 proc: >> * vanilla >> total execution time: 0.379756 seconds for 100000 loops >> * dirty >> total execution time: 0.477778 seconds for 100000 loops >> >> - 4 proc: >> * vanilla >> total execution time: 6.749710 seconds for 100000 loops >> * dirty >> total execution time: 4.651872 seconds for 100000 loops >> >> - 8 proc: >> * vanilla >> total execution time: 5.558404 seconds for 100000 loops >> * dirty >> total execution time: 7.143329 seconds for 100000 loops >> >> - 16 proc: >> * vanilla >> total execution time: 9.016398 seconds for 100000 loops >> * dirty >> total execution time: 9.412055 seconds for 100000 loops >> >> - 32 proc: >> * vanilla >> total execution time: 9.694451 seconds for 100000 loops >> * dirty >> total execution time: 9.990451 seconds for 100000 loops >> >> - 64 proc: >> * vanilla >> total execution time: 9.844984 seconds for 100032 loops >> * dirty >> total execution time: 10.016464 seconds for 100032 loops >> >> Lower task counts show pretty massive performance hits of ~9%, ~25% >> and ~30% for single, two and four/eight processes. As more are added >> I guess the overhead tends to disappear as for one you have a lot >> more locking contention going on. > >Can you check your notes on the 4 process case? As I read the 4 process >case above it is ~30% improvement. Either that is a typo or there is the >potential for quite a bit of noise in the test case. Yeah, sorry that was a typo. Unlike the second benchmark I didn't have this one automated but it's always the vanilla kernel that outperforms the dirty. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On 03/30/2018 09:09 PM, Davidlohr Bueso wrote: > On Wed, 28 Mar 2018, Davidlohr Bueso wrote: > >> On Fri, 23 Mar 2018, Eric W. Biederman wrote: >> >>> Today the last process to update a semaphore is remembered and >>> reported in the pid namespace of that process. If there are processes >>> in any other pid namespace querying that process id with GETPID the >>> result will be unusable nonsense as it does not make any >>> sense in your own pid namespace. >> >> Yeah that sounds pretty wrong. >> >>> >>> Due to ipc_update_pid I don't think you will be able to get System V >>> ipc semaphores into a troublesome cache line ping-pong. Using struct >>> pids from separate process are not a problem because they do not share >>> a cache line. Using struct pid from different threads of the same >>> process are unlikely to be a problem as the reference count update >>> can be avoided. >>> >>> Further linux futexes are a much better tool for the job of mutual >>> exclusion between processes than System V semaphores. So I expect >>> programs that are performance limited by their interprocess mutual >>> exclusion primitive will be using futexes. >> The performance of sysv sem and futexes for the contended case is more or less identical, it depends on the CONFIG_ options what is faster. And this is obvious, both primitives must do the same tasks: sleep: - lookup a kernel pointer from a user space reference - acquire a lock, do some housekeeping, unlock and sleep wakeup: - lookup a kernel pointer from a user space reference - acquire a lock, do some housekeeping, especially unlink the to be woken up task, unlock and wakeup The woken up task has nothing to do, it returns immediately to user space. IIRC for the uncontended case, sysvsem was at ~300 cpu cycles, but that number is a few years old, and I don't know what is the impact of spectre. The futex code is obviously faster. But I don't know which real-world applications do their own optimizations for the uncontended case before using sysvsem. Thus the only "real" challenge is to minimize cache line trashing. >> You would be wrong. There are plenty of real workloads out there >> that do not use futexes and are care about performance; in the end >> futexes are only good for the uncontended cases, it can also >> destroy numa boxes if you consider the global hash table. Experience >> as shown me that sysvipc sems are quite still used. >> >>> >>> So while it is possible that enhancing the storage of the last >>> rocess of a System V semaphore from an integer to a struct pid >>> will cause a performance regression because of the effect >>> of frequently updating the pid reference count. I don't expect >>> that to happen in practice. >> >> How's that? Now thanks to ipc_update_pid() for each semop the user >> passes, perform_atomic_semop() will do two atomic updates for the >> cases where there are multiple processes updating the sem. This is >> not uncommon. >> >> Could you please provide some numbers. > [...] > So at least for a large box this patch hurts the cases where there is low > to medium cpu usage (no more than ~8 processes on a 40 core box) in a non > trivial way. For more processes it doesn't matter. We can confirm that > the > case for threads is irrelevant. While I'm not happy about the 30% > regression > I guess we can live with this. > > Manfred, any thoughts? > Bugfixing has always first priority, and a 30% regression in one microbenchmark doesn't seem to be that bad. Thus I would propose that we fix SEMPID first, and _if_ someone notices a noticeable regression, then we must improve the code. -- Manfred -- To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/ipc/sem.c b/ipc/sem.c index d661c491b0a5..47b263960524 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -98,7 +98,7 @@ struct sem { * - semctl, via SETVAL and SETALL. * - at task exit when performing undo adjustments (see exit_sem). */ - int sempid; + struct pid *sempid; spinlock_t lock; /* spinlock for fine-grained semtimedop */ struct list_head pending_alter; /* pending single-sop operations */ /* that alter the semaphore */ @@ -128,7 +128,7 @@ struct sem_queue { struct list_head list; /* queue of pending operations */ struct task_struct *sleeper; /* this process */ struct sem_undo *undo; /* undo structure */ - int pid; /* process id of requesting process */ + struct pid *pid; /* process id of requesting process */ int status; /* completion status of operation */ struct sembuf *sops; /* array of pending operations */ struct sembuf *blocking; /* the operation that blocked */ @@ -628,7 +628,8 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg) */ static int perform_atomic_semop_slow(struct sem_array *sma, struct sem_queue *q) { - int result, sem_op, nsops, pid; + int result, sem_op, nsops; + struct pid *pid; struct sembuf *sop; struct sem *curr; struct sembuf *sops; @@ -666,7 +667,7 @@ static int perform_atomic_semop_slow(struct sem_array *sma, struct sem_queue *q) sop--; pid = q->pid; while (sop >= sops) { - sma->sems[sop->sem_num].sempid = pid; + ipc_update_pid(&sma->sems[sop->sem_num].sempid, pid); sop--; } @@ -753,7 +754,7 @@ static int perform_atomic_semop(struct sem_array *sma, struct sem_queue *q) un->semadj[sop->sem_num] = undo; } curr->semval += sem_op; - curr->sempid = q->pid; + ipc_update_pid(&curr->sempid, q->pid); } return 0; @@ -1160,6 +1161,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) unlink_queue(sma, q); wake_up_sem_queue_prepare(q, -EIDRM, &wake_q); } + ipc_update_pid(&sem->sempid, NULL); } /* Remove the semaphore set from the IDR */ @@ -1352,7 +1354,7 @@ static int semctl_setval(struct ipc_namespace *ns, int semid, int semnum, un->semadj[semnum] = 0; curr->semval = val; - curr->sempid = task_tgid_vnr(current); + ipc_update_pid(&curr->sempid, task_tgid(current)); sma->sem_ctime = ktime_get_real_seconds(); /* maybe some queued-up processes were waiting for this */ do_smart_update(sma, NULL, 0, 0, &wake_q); @@ -1473,7 +1475,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum, for (i = 0; i < nsems; i++) { sma->sems[i].semval = sem_io[i]; - sma->sems[i].sempid = task_tgid_vnr(current); + ipc_update_pid(&sma->sems[i].sempid, task_tgid(current)); } ipc_assert_locked_object(&sma->sem_perm); @@ -1505,7 +1507,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum, err = curr->semval; goto out_unlock; case GETPID: - err = curr->sempid; + err = pid_vnr(curr->sempid); goto out_unlock; case GETNCNT: err = count_semcnt(sma, semnum, 0); @@ -2024,7 +2026,7 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops, queue.sops = sops; queue.nsops = nsops; queue.undo = un; - queue.pid = task_tgid_vnr(current); + queue.pid = task_tgid(current); queue.alter = alter; queue.dupsop = dupsop; @@ -2318,7 +2320,7 @@ void exit_sem(struct task_struct *tsk) semaphore->semval = 0; if (semaphore->semval > SEMVMX) semaphore->semval = SEMVMX; - semaphore->sempid = task_tgid_vnr(current); + ipc_update_pid(&semaphore->sempid, task_tgid(current)); } } /* maybe some queued-up processes were waiting for this */
Today the last process to update a semaphore is remembered and reported in the pid namespace of that process. If there are processes in any other pid namespace querying that process id with GETPID the result will be unusable nonsense as it does not make any sense in your own pid namespace. Due to ipc_update_pid I don't think you will be able to get System V ipc semaphores into a troublesome cache line ping-pong. Using struct pids from separate process are not a problem because they do not share a cache line. Using struct pid from different threads of the same process are unlikely to be a problem as the reference count update can be avoided. Further linux futexes are a much better tool for the job of mutual exclusion between processes than System V semaphores. So I expect programs that are performance limited by their interprocess mutual exclusion primitive will be using futexes. So while it is possible that enhancing the storage of the last rocess of a System V semaphore from an integer to a struct pid will cause a performance regression because of the effect of frequently updating the pid reference count. I don't expect that to happen in practice. This change updates semctl(..., GETPID, ...) to return the process id of the last process to update a semphore inthe pid namespace of the calling process. Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- ipc/sem.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-)