diff mbox

[1/3] lockdep: Apply crossrelease to PG_locked locks

Message ID 1510802067-18609-2-git-send-email-byungchul.park@lge.com (mailing list archive)
State New, archived
Headers show

Commit Message

Byungchul Park Nov. 16, 2017, 3:14 a.m. UTC
Although lock_page() and its family can cause deadlock, lockdep have not
worked with them, because unlock_page() might be called in a different
context from the acquire context, which violated lockdep's assumption.

Now CONFIG_LOCKDEP_CROSSRELEASE has been introduced, lockdep can work
with page locks.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
---
 include/linux/mm_types.h |   8 ++++
 include/linux/pagemap.h  | 101 ++++++++++++++++++++++++++++++++++++++++++++---
 lib/Kconfig.debug        |   7 ++++
 mm/filemap.c             |   4 +-
 mm/page_alloc.c          |   3 ++
 5 files changed, 115 insertions(+), 8 deletions(-)

Comments

Michal Hocko Nov. 16, 2017, 12:02 p.m. UTC | #1
[I have only briefly looked at patches so I might have missed some
details.]

On Thu 16-11-17 12:14:25, Byungchul Park wrote:
> Although lock_page() and its family can cause deadlock, lockdep have not
> worked with them, because unlock_page() might be called in a different
> context from the acquire context, which violated lockdep's assumption.
>
> Now CONFIG_LOCKDEP_CROSSRELEASE has been introduced, lockdep can work
> with page locks.

I definitely agree that debugging page_lock deadlocks is a major PITA
but your implementation seems prohibitively too expensive.

[...]
> @@ -218,6 +222,10 @@ struct page {
>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>  	int _last_cpupid;
>  #endif
> +
> +#ifdef CONFIG_LOCKDEP_PAGELOCK
> +	struct lockdep_map_cross map;
> +#endif
>  }

now you are adding 
struct lockdep_map_cross {
        struct lockdep_map         map;                  /*     0    40 */
        struct cross_lock          xlock;                /*    40    56 */
        /* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */

        /* size: 96, cachelines: 2, members: 2 */
        /* last cacheline: 32 bytes */
};

for each struct page. So you are doubling the size. Who is going to
enable this config option? You are moving this to page_ext in a later
patch which is a good step but it doesn't go far enough because this
still consumes those resources. Is there any problem to make this
kernel command line controllable? Something we do for page_owner for
example?

Also it would be really great if you could give us some measures about
the runtime overhead. I do not expect it to be very large but this is
something people are usually interested in when enabling debugging
features.
Byungchul Park Nov. 16, 2017, 12:48 p.m. UTC | #2
On 11/16/2017 9:02 PM, Michal Hocko wrote:
> for each struct page. So you are doubling the size. Who is going to
> enable this config option? You are moving this to page_ext in a later
> patch which is a good step but it doesn't go far enough because this
> still consumes those resources. Is there any problem to make this
> kernel command line controllable? Something we do for page_owner for
> example?

Sure. I will add it.

> Also it would be really great if you could give us some measures about
> the runtime overhead. I do not expect it to be very large but this is

The major overhead would come from the amount of additional memory
consumption for 'lockdep_map's.

Do you want me to measure the overhead by the additional memory
consumption?

Or do you expect another overhead?

Could you tell me what kind of result you want to get?

> something people are usually interested in when enabling debugging
> features.
Michal Hocko Nov. 16, 2017, 1:07 p.m. UTC | #3
On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > for each struct page. So you are doubling the size. Who is going to
> > enable this config option? You are moving this to page_ext in a later
> > patch which is a good step but it doesn't go far enough because this
> > still consumes those resources. Is there any problem to make this
> > kernel command line controllable? Something we do for page_owner for
> > example?
> 
> Sure. I will add it.
> 
> > Also it would be really great if you could give us some measures about
> > the runtime overhead. I do not expect it to be very large but this is
> 
> The major overhead would come from the amount of additional memory
> consumption for 'lockdep_map's.

yes

> Do you want me to measure the overhead by the additional memory
> consumption?
> 
> Or do you expect another overhead?

I would be also interested how much impact this has on performance. I do
not expect it would be too large but having some numbers for cache cold
parallel kbuild or other heavy page lock workloads.
Byungchul Park Nov. 24, 2017, 3:02 a.m. UTC | #4
On Thu, Nov 16, 2017 at 02:07:46PM +0100, Michal Hocko wrote:
> On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> > On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > > for each struct page. So you are doubling the size. Who is going to
> > > enable this config option? You are moving this to page_ext in a later
> > > patch which is a good step but it doesn't go far enough because this
> > > still consumes those resources. Is there any problem to make this
> > > kernel command line controllable? Something we do for page_owner for
> > > example?
> > 
> > Sure. I will add it.
> > 
> > > Also it would be really great if you could give us some measures about
> > > the runtime overhead. I do not expect it to be very large but this is
> > 
> > The major overhead would come from the amount of additional memory
> > consumption for 'lockdep_map's.
> 
> yes
> 
> > Do you want me to measure the overhead by the additional memory
> > consumption?
> > 
> > Or do you expect another overhead?
> 
> I would be also interested how much impact this has on performance. I do
> not expect it would be too large but having some numbers for cache cold
> parallel kbuild or other heavy page lock workloads.

Hello Michal,

I measured 'cache cold parallel kbuild' on my qemu machine. The result
varies much so I cannot confirm, but I think there's no meaningful
difference between before and after applying crossrelease to page locks.

Actually, I expect little overhead in lock_page() and unlock_page() even
after applying crossreleas to page locks, but only expect a bit overhead
by additional memory consumption for 'lockdep_map's per page.

I run the following instructions within "QEMU x86_64 4GB memory 4 cpus":

   make clean
   echo 3 > drop_caches
   time make -j4

The results are:

   # w/o page lock tracking

   At the 1st try,
   real     5m28.105s
   user     17m52.716s
   sys      3m8.871s

   At the 2nd try,
   real     5m27.023s
   user     17m50.134s
   sys      3m9.289s

   At the 3rd try,
   real     5m22.837s
   user     17m34.514s
   sys      3m8.097s

   # w/ page lock tracking

   At the 1st try,
   real     5m18.158s
   user     17m18.200s
   sys      3m8.639s

   At the 2nd try,
   real     5m19.329s
   user     17m19.982s
   sys      3m8.345s

   At the 3rd try,
   real     5m19.626s
   user     17m21.363s
   sys      3m9.869s

I think thers's no meaningful difference on my small machine.

--
Thanks,
Byungchul
Michal Hocko Nov. 24, 2017, 8:11 a.m. UTC | #5
On Fri 24-11-17 12:02:36, Byungchul Park wrote:
> On Thu, Nov 16, 2017 at 02:07:46PM +0100, Michal Hocko wrote:
> > On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> > > On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > > > for each struct page. So you are doubling the size. Who is going to
> > > > enable this config option? You are moving this to page_ext in a later
> > > > patch which is a good step but it doesn't go far enough because this
> > > > still consumes those resources. Is there any problem to make this
> > > > kernel command line controllable? Something we do for page_owner for
> > > > example?
> > > 
> > > Sure. I will add it.
> > > 
> > > > Also it would be really great if you could give us some measures about
> > > > the runtime overhead. I do not expect it to be very large but this is
> > > 
> > > The major overhead would come from the amount of additional memory
> > > consumption for 'lockdep_map's.
> > 
> > yes
> > 
> > > Do you want me to measure the overhead by the additional memory
> > > consumption?
> > > 
> > > Or do you expect another overhead?
> > 
> > I would be also interested how much impact this has on performance. I do
> > not expect it would be too large but having some numbers for cache cold
> > parallel kbuild or other heavy page lock workloads.
> 
> Hello Michal,
> 
> I measured 'cache cold parallel kbuild' on my qemu machine. The result
> varies much so I cannot confirm, but I think there's no meaningful
> difference between before and after applying crossrelease to page locks.
> 
> Actually, I expect little overhead in lock_page() and unlock_page() even
> after applying crossreleas to page locks, but only expect a bit overhead
> by additional memory consumption for 'lockdep_map's per page.
> 
> I run the following instructions within "QEMU x86_64 4GB memory 4 cpus":
> 
>    make clean
>    echo 3 > drop_caches
>    time make -j4

Maybe FS people will help you find a more representative workload. E.g.
linear cache cold file read should be good as well. Maybe there are some
tests in fstests (or how they call xfstests these days).

> The results are:
> 
>    # w/o page lock tracking
> 
>    At the 1st try,
>    real     5m28.105s
>    user     17m52.716s
>    sys      3m8.871s
> 
>    At the 2nd try,
>    real     5m27.023s
>    user     17m50.134s
>    sys      3m9.289s
> 
>    At the 3rd try,
>    real     5m22.837s
>    user     17m34.514s
>    sys      3m8.097s
> 
>    # w/ page lock tracking
> 
>    At the 1st try,
>    real     5m18.158s
>    user     17m18.200s
>    sys      3m8.639s
> 
>    At the 2nd try,
>    real     5m19.329s
>    user     17m19.982s
>    sys      3m8.345s
> 
>    At the 3rd try,
>    real     5m19.626s
>    user     17m21.363s
>    sys      3m9.869s
> 
> I think thers's no meaningful difference on my small machine.

Yeah, this doesn't seem to indicate anything. Maybe moving the build to
shmem to rule out IO cost would tell more. But as I've said previously
page I do not really expect this would be very visible. It was more a
matter of my curiosity than an acceptance requirement. I think it is
much more important to make this runtime configurable because almost
nobody is going to enable the feature if it is only build time. The cost
is jut too high.
Jan Kara Nov. 24, 2017, 9:38 a.m. UTC | #6
On Fri 24-11-17 09:11:49, Michal Hocko wrote:
> On Fri 24-11-17 12:02:36, Byungchul Park wrote:
> > On Thu, Nov 16, 2017 at 02:07:46PM +0100, Michal Hocko wrote:
> > > On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> > > > On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > > > > for each struct page. So you are doubling the size. Who is going to
> > > > > enable this config option? You are moving this to page_ext in a later
> > > > > patch which is a good step but it doesn't go far enough because this
> > > > > still consumes those resources. Is there any problem to make this
> > > > > kernel command line controllable? Something we do for page_owner for
> > > > > example?
> > > > 
> > > > Sure. I will add it.
> > > > 
> > > > > Also it would be really great if you could give us some measures about
> > > > > the runtime overhead. I do not expect it to be very large but this is
> > > > 
> > > > The major overhead would come from the amount of additional memory
> > > > consumption for 'lockdep_map's.
> > > 
> > > yes
> > > 
> > > > Do you want me to measure the overhead by the additional memory
> > > > consumption?
> > > > 
> > > > Or do you expect another overhead?
> > > 
> > > I would be also interested how much impact this has on performance. I do
> > > not expect it would be too large but having some numbers for cache cold
> > > parallel kbuild or other heavy page lock workloads.
> > 
> > Hello Michal,
> > 
> > I measured 'cache cold parallel kbuild' on my qemu machine. The result
> > varies much so I cannot confirm, but I think there's no meaningful
> > difference between before and after applying crossrelease to page locks.
> > 
> > Actually, I expect little overhead in lock_page() and unlock_page() even
> > after applying crossreleas to page locks, but only expect a bit overhead
> > by additional memory consumption for 'lockdep_map's per page.
> > 
> > I run the following instructions within "QEMU x86_64 4GB memory 4 cpus":
> > 
> >    make clean
> >    echo 3 > drop_caches
> >    time make -j4
> 
> Maybe FS people will help you find a more representative workload. E.g.
> linear cache cold file read should be good as well. Maybe there are some
> tests in fstests (or how they call xfstests these days).

So a relatively good test of page handling costs is to mmap cache hot file
and measure time to fault in all the pages in the mapping. That way IO and
filesystem stays out of the way and you measure only page table lookup,
page handling (taking page ref and locking the page), and instantiation of
the new PTE. Out of this page handling is actually the significant part.

								Honza
diff mbox

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c85f11d..263b861 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -17,6 +17,10 @@ 
 
 #include <asm/mmu.h>
 
+#ifdef CONFIG_LOCKDEP_PAGELOCK
+#include <linux/lockdep.h>
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -218,6 +222,10 @@  struct page {
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	int _last_cpupid;
 #endif
+
+#ifdef CONFIG_LOCKDEP_PAGELOCK
+	struct lockdep_map_cross map;
+#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e08b533..35b4f67 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -15,6 +15,9 @@ 
 #include <linux/bitops.h>
 #include <linux/hardirq.h> /* for in_interrupt() */
 #include <linux/hugetlb_inline.h>
+#ifdef CONFIG_LOCKDEP_PAGELOCK
+#include <linux/lockdep.h>
+#endif
 
 /*
  * Bits in mapping->flags.
@@ -457,26 +460,91 @@  static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
 	return pgoff;
 }
 
+#ifdef CONFIG_LOCKDEP_PAGELOCK
+#define lock_page_init(p)						\
+do {									\
+	static struct lock_class_key __key;				\
+	lockdep_init_map_crosslock((struct lockdep_map *)&(p)->map,	\
+			"(PG_locked)" #p, &__key, 0);			\
+} while (0)
+
+static inline void lock_page_acquire(struct page *page, int try)
+{
+	page = compound_head(page);
+	lock_acquire_exclusive((struct lockdep_map *)&page->map, 0,
+			       try, NULL, _RET_IP_);
+}
+
+static inline void lock_page_release(struct page *page)
+{
+	page = compound_head(page);
+	/*
+	 * lock_commit_crosslock() is necessary for crosslocks.
+	 */
+	lock_commit_crosslock((struct lockdep_map *)&page->map);
+	lock_release((struct lockdep_map *)&page->map, 0, _RET_IP_);
+}
+#else
+static inline void lock_page_init(struct page *page) {}
+static inline void lock_page_free(struct page *page) {}
+static inline void lock_page_acquire(struct page *page, int try) {}
+static inline void lock_page_release(struct page *page) {}
+#endif
+
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				unsigned int flags);
-extern void unlock_page(struct page *page);
+extern void do_raw_unlock_page(struct page *page);
 
-static inline int trylock_page(struct page *page)
+static inline void unlock_page(struct page *page)
+{
+	lock_page_release(page);
+	do_raw_unlock_page(page);
+}
+
+static inline int do_raw_trylock_page(struct page *page)
 {
 	page = compound_head(page);
 	return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
 }
 
+static inline int trylock_page(struct page *page)
+{
+	if (do_raw_trylock_page(page)) {
+		lock_page_acquire(page, 1);
+		return 1;
+	}
+	return 0;
+}
+
 /*
  * lock_page may only be called if we have the page's inode pinned.
  */
 static inline void lock_page(struct page *page)
 {
 	might_sleep();
-	if (!trylock_page(page))
+
+	if (!do_raw_trylock_page(page))
 		__lock_page(page);
+	/*
+	 * acquire() must be after actual lock operation for crosslocks.
+	 * This way a crosslock and current lock can be ordered like:
+	 *
+	 *	CONTEXT 1		CONTEXT 2
+	 *	---------		---------
+	 *	lock A (cross)
+	 *	acquire A
+	 *	  X = atomic_inc_return(&cross_gen_id)
+	 *	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+	 *				acquire B
+	 *				  Y = atomic_read_acquire(&cross_gen_id)
+	 *				lock B
+	 *
+	 * so that 'lock A and then lock B' can be seen globally,
+	 * if X <= Y.
+	 */
+	lock_page_acquire(page, 0);
 }
 
 /*
@@ -486,9 +554,20 @@  static inline void lock_page(struct page *page)
  */
 static inline int lock_page_killable(struct page *page)
 {
+	int ret;
+
 	might_sleep();
-	if (!trylock_page(page))
-		return __lock_page_killable(page);
+
+	if (!do_raw_trylock_page(page)) {
+		ret = __lock_page_killable(page);
+		if (ret)
+			return ret;
+	}
+	/*
+	 * acquire() must be after actual lock operation for crosslocks.
+	 * This way a crosslock and other locks can be ordered.
+	 */
+	lock_page_acquire(page, 0);
 	return 0;
 }
 
@@ -503,7 +582,17 @@  static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				     unsigned int flags)
 {
 	might_sleep();
-	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
+
+	if (do_raw_trylock_page(page) || __lock_page_or_retry(page, mm, flags)) {
+		/*
+		 * acquire() must be after actual lock operation for crosslocks.
+		 * This way a crosslock and other locks can be ordered.
+		 */
+		lock_page_acquire(page, 0);
+		return 1;
+	}
+
+	return 0;
 }
 
 /*
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2b439a5..2e8c679 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1094,6 +1094,7 @@  config PROVE_LOCKING
 	select DEBUG_LOCK_ALLOC
 	select LOCKDEP_CROSSRELEASE
 	select LOCKDEP_COMPLETIONS
+	select LOCKDEP_PAGELOCK
 	select TRACE_IRQFLAGS
 	default n
 	help
@@ -1179,6 +1180,12 @@  config LOCKDEP_COMPLETIONS
 	 A deadlock caused by wait_for_completion() and complete() can be
 	 detected by lockdep using crossrelease feature.
 
+config LOCKDEP_PAGELOCK
+	bool
+	help
+	 PG_locked lock is a kind of crosslock. Using crossrelease feature,
+	 PG_locked lock can work with lockdep.
+
 config BOOTPARAM_LOCKDEP_CROSSRELEASE_FULLSTACK
 	bool "Enable the boot parameter, crossrelease_fullstack"
 	depends on LOCKDEP_CROSSRELEASE
diff --git a/mm/filemap.c b/mm/filemap.c
index 594d73f..870d442 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1099,7 +1099,7 @@  static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
  * portably (architectures that do LL/SC can test any bit, while x86 can
  * test the sign bit).
  */
-void unlock_page(struct page *page)
+void do_raw_unlock_page(struct page *page)
 {
 	BUILD_BUG_ON(PG_waiters != 7);
 	page = compound_head(page);
@@ -1107,7 +1107,7 @@  void unlock_page(struct page *page)
 	if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags))
 		wake_up_page_bit(page, PG_locked);
 }
-EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(do_raw_unlock_page);
 
 /**
  * end_page_writeback - end writeback against a page
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e4d3c..8436b28 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5371,6 +5371,9 @@  void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		} else {
 			__init_single_pfn(pfn, zone, nid);
 		}
+#ifdef CONFIG_LOCKDEP_PAGELOCK
+		lock_page_init(pfn_to_page(pfn));
+#endif
 	}
 }