diff mbox series

[PATCH-tip,15/22] locking/rwsem: Merge owner into count on x86-64

Message ID 1549566446-27967-16-git-send-email-longman@redhat.com (mailing list archive)
State New, archived
Headers show
Series locking/rwsem: Rework rwsem-xadd & enable new rwsem features | expand

Commit Message

Waiman Long Feb. 7, 2019, 7:07 p.m. UTC
With separate count and owner, there are timing windows where the two
values are inconsistent. That can cause problem when trying to figure
out the exact state of the rwsem. For instance, a RT task will stop
optimistic spinning if the lock is acquired by a writer but the owner
field isn't set yet. That can be solved by combining the count and
owner together in a single atomic value.

On 32-bit architectures, there aren't enough bits to hold both.
64-bit architectures, however, can have enough bits to do that. For
x86-64, the physical address can use up to 52 bits. That is 4PB of
memory. That leaves 12 bits available for other use. The task structure
pointer is also aligned to the L1 cache size. That means another 6 bits
(64 bytes cacheline) will be available. Reserving 2 bits for status
flags, we will have 16 bits for the reader count.  That can supports
up to (64k-1) readers.

The owner value will still be duplicated in the owner field for the
purpose of signalling that the task is in the process of acquiring or
releasing a rwsem.

This change is currently for x86-64 only. Other 64-bit architectures may
be enabled in the future if the need arises.

With a locking microbenchmark running on 5.0 based kernel, the total
locking rates (in kops/s) of the benchmark on a 4-socket 56-core
x86-64 system before and after the patch were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        29,085   30,179   27,892   29,514
        2         7,341   14,084    6,240   14,304
        4         7,393   14,246    5,216   11,754
        8         7,139   13,860    5,400   11,308
       16         6,650   15,773    5,744   15,405

This change does have an impact on both read and write lock performance.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem-xadd.c |  20 +++++++--
 kernel/locking/rwsem-xadd.h | 105 +++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 110 insertions(+), 15 deletions(-)

Comments

Peter Zijlstra Feb. 7, 2019, 7:45 p.m. UTC | #1
On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
> On 32-bit architectures, there aren't enough bits to hold both.
> 64-bit architectures, however, can have enough bits to do that. For
> x86-64, the physical address can use up to 52 bits. That is 4PB of
> memory. That leaves 12 bits available for other use. The task structure
> pointer is also aligned to the L1 cache size. That means another 6 bits
> (64 bytes cacheline) will be available. Reserving 2 bits for status
> flags, we will have 16 bits for the reader count.  That can supports
> up to (64k-1) readers.

*groan*...

So take qrwlock's idea for a queue, then make the count value (similar
to the new mutex); that is have a bit0 be a r/w bit, when w bits 6-N are
owner, when r they are reader-count. bit1 can be a pending bit, bit2 a
handoff bit etc..

That should fit and work on 32bit and 64bit without issue.

I have a half-arsed rwsem-atomic.c somewhere that does just that. I just
never got around to doing all the optimistic spin and steal crap that
makes our current rwsem fly.

And that nicely gets rid of that mind bending BIAS crud.
Waiman Long Feb. 7, 2019, 7:55 p.m. UTC | #2
On 02/07/2019 02:45 PM, Peter Zijlstra wrote:
> On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
>> On 32-bit architectures, there aren't enough bits to hold both.
>> 64-bit architectures, however, can have enough bits to do that. For
>> x86-64, the physical address can use up to 52 bits. That is 4PB of
>> memory. That leaves 12 bits available for other use. The task structure
>> pointer is also aligned to the L1 cache size. That means another 6 bits
>> (64 bytes cacheline) will be available. Reserving 2 bits for status
>> flags, we will have 16 bits for the reader count.  That can supports
>> up to (64k-1) readers.
> *groan*...
>
> So take qrwlock's idea for a queue, then make the count value (similar
> to the new mutex); that is have a bit0 be a r/w bit, when w bits 6-N are
> owner, when r they are reader-count. bit1 can be a pending bit, bit2 a
> handoff bit etc..
>
> That should fit and work on 32bit and 64bit without issue.
>
> I have a half-arsed rwsem-atomic.c somewhere that does just that. I just
> never got around to doing all the optimistic spin and steal crap that
> makes our current rwsem fly.
>
> And that nicely gets rid of that mind bending BIAS crud.

Well, the reason for this compromise is to keep using xadd for readers.
Your scheme will certainly work, but we have to use cmpxchg for readers
too. That will have a performance impact especially with multiple
readers contending which I am trying to avoid.

Cheers,
Longman
Peter Zijlstra Feb. 7, 2019, 8:08 p.m. UTC | #3
On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
> On 32-bit architectures, there aren't enough bits to hold both.
> 64-bit architectures, however, can have enough bits to do that. For
> x86-64, the physical address can use up to 52 bits. That is 4PB of
> memory. That leaves 12 bits available for other use. The task structure
> pointer is also aligned to the L1 cache size. That means another 6 bits
> (64 bytes cacheline) will be available. Reserving 2 bits for status
> flags, we will have 16 bits for the reader count.  That can supports
> up to (64k-1) readers.

64k readers sounds like a number that is fairly 'easy' to reach, esp. on
64bit. These are preemptible locks after all, all we need to do is get
64k tasks nested on enough CPUs.

I'm sure there's some willing Java proglet around that spawns more than
64k threads just because it can. Run it on a big enough machine (ISTR
there's a number of >1k CPU systems out there) and voila.
Waiman Long Feb. 7, 2019, 8:54 p.m. UTC | #4
On 02/07/2019 03:08 PM, Peter Zijlstra wrote:
> On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
>> On 32-bit architectures, there aren't enough bits to hold both.
>> 64-bit architectures, however, can have enough bits to do that. For
>> x86-64, the physical address can use up to 52 bits. That is 4PB of
>> memory. That leaves 12 bits available for other use. The task structure
>> pointer is also aligned to the L1 cache size. That means another 6 bits
>> (64 bytes cacheline) will be available. Reserving 2 bits for status
>> flags, we will have 16 bits for the reader count.  That can supports
>> up to (64k-1) readers.
> 64k readers sounds like a number that is fairly 'easy' to reach, esp. on
> 64bit. These are preemptible locks after all, all we need to do is get
> 64k tasks nested on enough CPUs.
>
> I'm sure there's some willing Java proglet around that spawns more than
> 64k threads just because it can. Run it on a big enough machine (ISTR
> there's a number of >1k CPU systems out there) and voila.

Yes, that can be a problem.

One possible solution is to check if the count goes negative. If so,
fail the read lock and make the readers wait in the wait queue until the
count is in positive territory. That effectively reduces the reader
count to 15 bits, but it will avoid the overflow situation. I will try
to add that support into the next version.

Cheers,
Longman
Waiman Long Feb. 8, 2019, 2:19 p.m. UTC | #5
On 02/07/2019 03:54 PM, Waiman Long wrote:
> On 02/07/2019 03:08 PM, Peter Zijlstra wrote:
>> On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
>>> On 32-bit architectures, there aren't enough bits to hold both.
>>> 64-bit architectures, however, can have enough bits to do that. For
>>> x86-64, the physical address can use up to 52 bits. That is 4PB of
>>> memory. That leaves 12 bits available for other use. The task structure
>>> pointer is also aligned to the L1 cache size. That means another 6 bits
>>> (64 bytes cacheline) will be available. Reserving 2 bits for status
>>> flags, we will have 16 bits for the reader count.  That can supports
>>> up to (64k-1) readers.
>> 64k readers sounds like a number that is fairly 'easy' to reach, esp. on
>> 64bit. These are preemptible locks after all, all we need to do is get
>> 64k tasks nested on enough CPUs.
>>
>> I'm sure there's some willing Java proglet around that spawns more than
>> 64k threads just because it can. Run it on a big enough machine (ISTR
>> there's a number of >1k CPU systems out there) and voila.
> Yes, that can be a problem.
>
> One possible solution is to check if the count goes negative. If so,
> fail the read lock and make the readers wait in the wait queue until the
> count is in positive territory. That effectively reduces the reader
> count to 15 bits, but it will avoid the overflow situation. I will try
> to add that support into the next version.
>
> Cheers,
> Longman

Something like the attached patch.

Cheers,
Longman
From 746913e7d14e874eeace1e146e63bdaea4dfd4a5 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Fri, 8 Feb 2019 08:58:10 -0500
Subject: [PATCH 23/23] locking/rwsem: Make MSbit of count as guard bit to fail
 readlock

With the merging of owner into count for x86-64, there is only 16 bits
left for reader count. It is theoretically possible for an application to
cause more than 64k readers to acquire a rwsem leading to count overflow.

To prevent this dire situation, the most significant bit of the count
is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock will now
fails for both the fast and optimistic spinning paths whenever this bit
is set. So all those extra readers will be put to sleep in the wait
queue. Wakeup will not happen until the reader count reaches 0.

A limit of 256 is also imposed on the number of readers that can be woken
up in one wakeup function call. This will eliminate the possibility of
waking up more than 64k readers and overflowing the count.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |  1 +
 kernel/locking/rwsem-xadd.c       | 40 ++++++++++++++++++++++++++++++++------
 kernel/locking/rwsem-xadd.h       | 41 ++++++++++++++++++++++++++-------------
 3 files changed, 62 insertions(+), 20 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 0052534..9ecdeac 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -60,6 +60,7 @@
 LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
+LOCK_EVENT(rwsem_opt_rfail)	/* # of failed reader-owned readlocks	*/
 LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled reader opt-spinnings	*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 213c2aa..a993055 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -110,6 +110,8 @@ enum rwsem_wake_type {
 # define RWSEM_RSPIN_MAX	(1 << 12)
 #endif
 
+#define MAX_READERS_WAKEUP	0x100
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -208,6 +210,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		 * after setting the reader waiter to nil.
 		 */
 		wake_q_add_safe(wake_q, tsk);
+
+		/*
+		 * Limit # of readers that can be woken up per wakeup call.
+		 */
+		if (woken >= MAX_READERS_WAKEUP)
+			break;
 	}
 
 	adjustment = woken * RWSEM_READER_BIAS - adjustment;
@@ -445,6 +453,16 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
 			break;
 
 		/*
+		 * If a reader cannot acquire a reader-owned lock, we
+		 * have to quit. It is either the handoff bit just got
+		 * set or (unlikely) readfail bit is somehow set.
+		 */
+		if (unlikely(!wlock && (owner_state == OWNER_READER))) {
+			lockevent_inc(rwsem_opt_rfail);
+			break;
+		}
+
+		/*
 		 * An RT task cannot do optimistic spinning if it cannot
 		 * be sure the lock holder is running. When there's no owner
 		 * or is reader-owned, an RT task has to stop spinning or
@@ -526,12 +544,22 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem,
  * Wait for the read lock to be granted
  */
 static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state, long count)
 {
-	long count, adjustment = -RWSEM_READER_BIAS;
+	long adjustment = -RWSEM_READER_BIAS;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	if (unlikely(count < 0)) {
+		/*
+		 * Too many active readers, decrement count &
+		 * enter the wait queue.
+		 */
+		atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+		adjustment = 0;
+		goto queue;
+	}
+
 	if (!rwsem_can_spin_on_owner(sem))
 		goto queue;
 
@@ -635,16 +663,16 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem,
 }
 
 __visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
+rwsem_down_read_failed(struct rw_semaphore *sem, long cnt)
 {
-	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
+	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE, cnt);
 }
 EXPORT_SYMBOL(rwsem_down_read_failed);
 
 __visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
+rwsem_down_read_failed_killable(struct rw_semaphore *sem, long cnt)
 {
-	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
+	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE, cnt);
 }
 EXPORT_SYMBOL(rwsem_down_read_failed_killable);
 
diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index be67dbd..72308b7 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -63,7 +63,8 @@
  * Bit   0    - waiters present bit
  * Bit   1    - lock handoff bit
  * Bits  2-47 - compressed task structure pointer
- * Bits 48-63 - 16-bit reader counts
+ * Bits 48-62 - 15-bit reader counts
+ * Bit  63    - read fail bit
  *
  * On other 64-bit architectures, the bit definitions are:
  *
@@ -71,7 +72,8 @@
  * Bit  1    - lock handoff bit
  * Bits 2-6  - reserved
  * Bit  7    - writer lock bit
- * Bits 8-63 - 56-bit reader counts
+ * Bits 8-62 - 55-bit reader counts
+ * Bit  63   - read fail bit
  *
  * On 32-bit architectures, the bit definitions of the count are:
  *
@@ -79,13 +81,15 @@
  * Bit  1    - lock handoff bit
  * Bits 2-6  - reserved
  * Bit  7    - writer lock bit
- * Bits 8-31 - 24-bit reader counts
+ * Bits 8-30 - 23-bit reader counts
+ * Bit  32   - read fail bit
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
  */
 #define RWSEM_FLAG_WAITERS	(1UL << 0)
 #define RWSEM_FLAG_HANDOFF	(1UL << 1)
+#define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
 
 #ifdef CONFIG_X86_64
 
@@ -108,7 +112,7 @@
 #define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
-				 RWSEM_FLAG_HANDOFF)
+				 RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
 
 #define RWSEM_COUNT_LOCKED(c)	((c) & RWSEM_LOCK_MASK)
 #define RWSEM_COUNT_WLOCKED(c)	((c) & RWSEM_WRITER_MASK)
@@ -302,10 +306,15 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 }
 #endif
 
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_read_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed_killable(struct rw_semaphore *sem);
+extern struct rw_semaphore *
+rwsem_down_read_failed(struct rw_semaphore *sem, long count);
+extern struct rw_semaphore *
+rwsem_down_read_failed_killable(struct rw_semaphore *sem, long count);
+extern struct rw_semaphore *
+rwsem_down_write_failed(struct rw_semaphore *sem);
+extern struct rw_semaphore *
+rwsem_down_write_failed_killable(struct rw_semaphore *sem);
+
 extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count);
 extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
 
@@ -314,9 +323,11 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
  */
 static inline void __down_read(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_failed(sem);
+	long count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+						   &sem->count);
+
+	if (unlikely(count & RWSEM_READ_FAILED_MASK)) {
+		rwsem_down_read_failed(sem, count);
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
@@ -325,9 +336,11 @@ static inline void __down_read(struct rw_semaphore *sem)
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+	long count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+						   &sem->count);
+
+	if (unlikely(count & RWSEM_READ_FAILED_MASK)) {
+		if (IS_ERR(rwsem_down_read_failed_killable(sem, count)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
diff mbox series

Patch

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 719d390..0869fbf 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -27,11 +27,11 @@ 
 /*
  * Guide to the rw_semaphore's count field.
  *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
+ * When any of the RWSEM_WRITER_MASK bits in count is set, the lock is
+ * owned by a writer.
  *
  * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (1) none of the RWSEM_WRITER_MASK bits is set in count,
  * (2) some of the reader bits are set in count, and
  * (3) the owner field has RWSEM_READ_OWNED bit set.
  *
@@ -47,6 +47,11 @@ 
 void __init_rwsem(struct rw_semaphore *sem, const char *name,
 		  struct lock_class_key *key)
 {
+	/*
+	 * We should support at least (4k-1) concurrent readers
+	 */
+	BUILD_BUG_ON(sizeof(long) * 8 - RWSEM_READER_SHIFT < 12);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	/*
 	 * Make sure we are not reinitializing a held semaphore:
@@ -297,7 +302,14 @@  static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
 		return false;
 
 	rcu_read_lock();
-	while (owner && (rwsem_get_owner(sem) == owner)) {
+	/*
+	 * In case the owner task pointer is also stored in the count,
+	 * checking the sem->owner value alone will give an early indication
+	 * if the owner is about to release the lock (sem->owner cleared).
+	 * This enables the spinner to move forward and do a trylock
+	 * earlier.
+	 */
+	while (owner && (READ_ONCE(sem->owner) == owner)) {
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
 		 * checking sem->owner still matches owner, if that fails,
diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index 277a134..d54b5db 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -37,25 +37,73 @@ 
 #endif
 
 /*
- * The definition of the atomic counter in the semaphore:
+ * With separate count and owner, there are timing windows where the two
+ * values are inconsistent. That can cause problem when trying to figure
+ * out the exact state of the rwsem. That can be solved by combining
+ * the count and owner together in a single atomic value.
  *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bit  2   - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * On 64-bit architectures, the owner task structure pointer can be
+ * compressed and combined with reader count and other status flags.
+ * A simple compression method is to map the virtual address back to
+ * the physical address by subtracting PAGE_OFFSET. On 32-bit
+ * architectures, the long integer value just isn't big enough for
+ * combining owner and count. So they remain separate.
+ *
+ * For x86-64, the physical address can use up to 52 bits. That is 4PB
+ * of memory. That leaves 12 bits available for other use. The task
+ * structure pointer is also aligned to the L1 cache size. That means
+ * another 6 bits (64 bytes cacheline) will be available. Reserving
+ * 2 bits for status flags, we will have 16 bits for the reader count.
+ * That can supports up to (64k-1) readers.
+ *
+ * On x86-64, the bit definitions of the count are:
+ *
+ * Bit   0    - waiters present bit
+ * Bit   1    - lock handoff bit
+ * Bits  2-47 - compressed task structure pointer
+ * Bits 48-63 - 16-bit reader counts
+ *
+ * On other 64-bit architectures, the bit definitions are:
+ *
+ * Bit  0    - waiters present bit
+ * Bit  1    - lock handoff bit
+ * Bits 2-6  - reserved
+ * Bit  7    - writer lock bit
+ * Bits 8-63 - 56-bit reader counts
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit  0    - waiters present bit
+ * Bit  1    - lock handoff bit
+ * Bits 2-6  - reserved
+ * Bit  7    - writer lock bit
+ * Bits 8-31 - 24-bit reader counts
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
  */
-#define RWSEM_WRITER_LOCKED	(1UL << 0)
-#define RWSEM_FLAG_WAITERS	(1UL << 1)
-#define RWSEM_FLAG_HANDOFF	(1UL << 2)
+#define RWSEM_FLAG_WAITERS	(1UL << 0)
+#define RWSEM_FLAG_HANDOFF	(1UL << 1)
 
+#ifdef CONFIG_X86_64
+
+#ifdef __PHYSICAL_MASK_SHIFT
+#define RWSEM_PA_MASK_SHIFT	__PHYSICAL_MASK_SHIFT
+#else
+#define RWSEM_PA_MASK_SHIFT	52
+#endif
+#define RWSEM_READER_SHIFT	(RWSEM_PA_MASK_SHIFT - L1_CACHE_SHIFT + 2)
+#define RWSEM_WRITER_MASK	((1UL << RWSEM_READER_SHIFT) - 4)
+#define RWSEM_WRITER_LOCKED	rwsem_owner_count(current)
+
+#else /* CONFIG_X86_64 */
+#define RWSEM_WRITER_MASK	(1UL << 7)
 #define RWSEM_READER_SHIFT	8
+#define RWSEM_WRITER_LOCKED	RWSEM_WRITER_MASK
+#endif /* CONFIG_X86_64 */
+
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
 #define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
 				 RWSEM_FLAG_HANDOFF)
@@ -65,6 +113,21 @@ 
 #define RWSEM_COUNT_LOCKED_OR_HANDOFF(c)	\
 	((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))
 
+/*
+ * Task structure pointer compression (64-bit only):
+ * (owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2)
+ */
+static inline unsigned long rwsem_owner_count(struct task_struct *owner)
+{
+	return ((unsigned long)owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2);
+}
+
+static inline unsigned long rwsem_count_owner(long count)
+{
+	return (((unsigned long)count & RWSEM_WRITER_MASK)
+			<< (L1_CACHE_SHIFT - 2)) + PAGE_OFFSET;
+}
+
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -72,7 +135,12 @@ 
  * the owner value concurrently without lock. Read from owner, however,
  * may not need READ_ONCE() as long as the pointer value is only used
  * for comparison and isn't being dereferenced.
+ *
+ * On 32-bit architectures, the owner and count are separate. On 64-bit
+ * architectures, however, the writer task structure pointer is written
+ * to the count as well in addition to the owner field.
  */
+
 static inline void rwsem_set_owner(struct rw_semaphore *sem)
 {
 	WRITE_ONCE(sem->owner, current);
@@ -83,10 +151,22 @@  static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 	WRITE_ONCE(sem->owner, NULL);
 }
 
+#ifdef CONFIG_X86_64
+/*
+ * Get the owner value from count to have early access to the task structure.
+ */
+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+	return (struct task_struct *)
+		(rwsem_count_owner(atomic_long_read(&sem->count)) |
+		((unsigned long)READ_ONCE(sem->owner) & 3));
+}
+#else /* !CONFIG_X86_64 */
 static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
 {
 	return READ_ONCE(sem->owner);
 }
+#endif /* CONFIG_X86_64 */
 
 /*
  * The task_struct pointer of the last owning reader will be left in
@@ -291,8 +371,11 @@  static inline void __up_write(struct rw_semaphore *sem)
 	long tmp;
 
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+#ifdef CONFIG_X86_64
+	DEBUG_RWSEMS_WARN_ON(sem->owner != rwsem_get_owner(sem), sem);
+#endif
 	rwsem_clear_owner(sem);
-	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+	tmp = atomic_long_fetch_and_release(~RWSEM_WRITER_MASK, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem, tmp);
 }