diff mbox series

[v2,RFC] mm: fs: Invalidate BH LRU during page migration

Message ID c083b0ab6e410e33ca880d639f90ef4f6f3b33ff.1613020616.git.cgoldswo@codeaurora.org (mailing list archive)
State New, archived
Headers show
Series [v2,RFC] mm: fs: Invalidate BH LRU during page migration | expand

Commit Message

Chris Goldsworthy Feb. 11, 2021, 5:35 a.m. UTC
Pages containing buffer_heads that are in one of the per-CPU
buffer_head LRU caches will be pinned and thus cannot be migrated.
This can prevent CMA allocations from succeeding, which are often used
on platforms with co-processors (such as a DSP) that can only use
physically contiguous memory. It can also prevent memory
hot-unplugging from succeeding, which involves migrating at least
MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
GiB based on the architecture in use.

Correspondingly, invalidate the BH LRU caches before a migration
starts and stop any buffer_head from being cached in the LRU caches,
until migration has finished.

Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/buffer.c                 | 54 +++++++++++++++++++++++++++++++++++++++++++--
 include/linux/buffer_head.h |  8 +++++++
 include/linux/migrate.h     |  2 ++
 mm/migrate.c                | 19 ++++++++++++++++
 mm/page_alloc.c             |  3 +++
 mm/swap.c                   |  7 +++++-
 6 files changed, 90 insertions(+), 3 deletions(-)

Comments

Matthew Wilcox (Oracle) Feb. 11, 2021, 2:09 p.m. UTC | #1
On Wed, Feb 10, 2021 at 09:35:40PM -0800, Chris Goldsworthy wrote:
> +/* These are used to control the BH LRU invalidation during page migration */
> +static struct cpumask lru_needs_invalidation;
> +static bool bh_lru_disabled = false;

As I asked before, what protects this on an SMP system?

> @@ -1292,7 +1296,9 @@ static inline void check_irqs_on(void)
>  /*
>   * Install a buffer_head into this cpu's LRU.  If not already in the LRU, it is
>   * inserted at the front, and the buffer_head at the back if any is evicted.
> - * Or, if already in the LRU it is moved to the front.
> + * Or, if already in the LRU it is moved to the front. Note that if LRU is
> + * disabled because of an ongoing page migration, we won't insert bh into the
> + * LRU.

And also, why do we need to do this?  The page LRU has no equivalent
mechanism to prevent new pages being added to the per-CPU LRU lists.
If a BH has just been used, isn't that a strong hint that this page is
a bad candidate for migration?
Chris Goldsworthy Feb. 11, 2021, 7:39 p.m. UTC | #2
On 2021-02-11 06:09, Matthew Wilcox wrote:
> On Wed, Feb 10, 2021 at 09:35:40PM -0800, Chris Goldsworthy wrote:
>> +/* These are used to control the BH LRU invalidation during page 
>> migration */
>> +static struct cpumask lru_needs_invalidation;
>> +static bool bh_lru_disabled = false;
> 
> As I asked before, what protects this on an SMP system?
> 

Sorry Matthew, I misconstrued your earlier question in V1, and thought 
you had been referring to compile-time protection (so as to prevent 
build breakages).  It is not protected, so I'll need to change this into 
an atomic counter that is incremented and decremented by bh_lru_enable() 
and bh_lru_disable() respectively (such that if the counter is greater 
than zero, we bail).

>> @@ -1292,7 +1296,9 @@ static inline void check_irqs_on(void)
>>  /*
>>   * Install a buffer_head into this cpu's LRU.  If not already in the 
>> LRU, it is
>>   * inserted at the front, and the buffer_head at the back if any is 
>> evicted.
>> - * Or, if already in the LRU it is moved to the front.
>> + * Or, if already in the LRU it is moved to the front. Note that if 
>> LRU is
>> + * disabled because of an ongoing page migration, we won't insert bh 
>> into the
>> + * LRU.
> 
> And also, why do we need to do this?  The page LRU has no equivalent
> mechanism to prevent new pages being added to the per-CPU LRU lists.
> If a BH has just been used, isn't that a strong hint that this page is
> a bad candidate for migration?

I had assumed that up until now, that pages in the page cache aren't an 
issue, such that they're dropped during migration as needed. Looking at 
try_to_free_buffers[1], I don't see any handling for the page cache.  I 
will need to do due diligence and follow up on this.

As for the question on necessity, if there is a case in which preventing 
buffer_heads from being added to the BH LRU ensures that the containing 
page can be migrated, then I would say that the change is justified, 
since adds another scenario in which migration is guaranteed (I will 
follow up on this as well).

Regards,

Chris.

[1] https://elixir.bootlin.com/linux/latest/source/fs/buffer.c#L3225
Minchan Kim Feb. 11, 2021, 10:17 p.m. UTC | #3
On Wed, Feb 10, 2021 at 09:35:40PM -0800, Chris Goldsworthy wrote:
> Pages containing buffer_heads that are in one of the per-CPU
> buffer_head LRU caches will be pinned and thus cannot be migrated.
> This can prevent CMA allocations from succeeding, which are often used
> on platforms with co-processors (such as a DSP) that can only use
> physically contiguous memory. It can also prevent memory
> hot-unplugging from succeeding, which involves migrating at least
> MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
> GiB based on the architecture in use.
> 
> Correspondingly, invalidate the BH LRU caches before a migration
> starts and stop any buffer_head from being cached in the LRU caches,
> until migration has finished.
> 
> Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/buffer.c                 | 54 +++++++++++++++++++++++++++++++++++++++++++--
>  include/linux/buffer_head.h |  8 +++++++
>  include/linux/migrate.h     |  2 ++
>  mm/migrate.c                | 19 ++++++++++++++++
>  mm/page_alloc.c             |  3 +++
>  mm/swap.c                   |  7 +++++-
>  6 files changed, 90 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 96c7604..634e474 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1274,6 +1274,10 @@ struct bh_lru {
>  
>  static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }};
>  
> +/* These are used to control the BH LRU invalidation during page migration */
> +static struct cpumask lru_needs_invalidation;
> +static bool bh_lru_disabled = false;
> +
>  #ifdef CONFIG_SMP
>  #define bh_lru_lock()	local_irq_disable()
>  #define bh_lru_unlock()	local_irq_enable()
> @@ -1292,7 +1296,9 @@ static inline void check_irqs_on(void)
>  /*
>   * Install a buffer_head into this cpu's LRU.  If not already in the LRU, it is
>   * inserted at the front, and the buffer_head at the back if any is evicted.
> - * Or, if already in the LRU it is moved to the front.
> + * Or, if already in the LRU it is moved to the front. Note that if LRU is
> + * disabled because of an ongoing page migration, we won't insert bh into the
> + * LRU.
>   */
>  static void bh_lru_install(struct buffer_head *bh)
>  {
> @@ -1303,6 +1309,9 @@ static void bh_lru_install(struct buffer_head *bh)
>  	check_irqs_on();
>  	bh_lru_lock();
>  
> +	if (bh_lru_disabled)
> +		goto out;
> +
>  	b = this_cpu_ptr(&bh_lrus);
>  	for (i = 0; i < BH_LRU_SIZE; i++) {
>  		swap(evictee, b->bhs[i]);
> @@ -1313,6 +1322,7 @@ static void bh_lru_install(struct buffer_head *bh)
>  	}
>  
>  	get_bh(bh);
> +out:
>  	bh_lru_unlock();
>  	brelse(evictee);
>  }
> @@ -1328,6 +1338,10 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
>  
>  	check_irqs_on();
>  	bh_lru_lock();
> +
> +	if (bh_lru_disabled)
> +		goto out;
> +
>  	for (i = 0; i < BH_LRU_SIZE; i++) {
>  		struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
>  
> @@ -1346,6 +1360,7 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
>  			break;
>  		}
>  	}
> +out:
>  	bh_lru_unlock();
>  	return ret;
>  }
> @@ -1446,7 +1461,7 @@ EXPORT_SYMBOL(__bread_gfp);
>   * This doesn't race because it runs in each cpu either in irq
>   * or with preempt disabled.
>   */
> -static void invalidate_bh_lru(void *arg)
> +void invalidate_bh_lru(void *arg)
>  {
>  	struct bh_lru *b = &get_cpu_var(bh_lrus);
>  	int i;
> @@ -1477,6 +1492,41 @@ void invalidate_bh_lrus(void)
>  }
>  EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
>  
> +bool need_bh_lru_invalidation(int cpu)
> +{
> +	return cpumask_test_cpu(cpu, &lru_needs_invalidation);
> +}
> +
> +void bh_lru_disable(void)
> +{
> +	int cpu;
> +
> +	bh_lru_disabled = true;

What happens if the function is nested? Shouldn't we make it count?
So only disble when the count is zero.
Minchan Kim Feb. 11, 2021, 10:54 p.m. UTC | #4
On Thu, Feb 11, 2021 at 11:39:05AM -0800, Chris Goldsworthy wrote:
> On 2021-02-11 06:09, Matthew Wilcox wrote:
> > On Wed, Feb 10, 2021 at 09:35:40PM -0800, Chris Goldsworthy wrote:
> > > +/* These are used to control the BH LRU invalidation during page
> > > migration */
> > > +static struct cpumask lru_needs_invalidation;
> > > +static bool bh_lru_disabled = false;
> > 
> > As I asked before, what protects this on an SMP system?
> > 
> 
> Sorry Matthew, I misconstrued your earlier question in V1, and thought you
> had been referring to compile-time protection (so as to prevent build
> breakages).  It is not protected, so I'll need to change this into an atomic
> counter that is incremented and decremented by bh_lru_enable() and
> bh_lru_disable() respectively (such that if the counter is greater than
> zero, we bail).
> 
> > > @@ -1292,7 +1296,9 @@ static inline void check_irqs_on(void)
> > >  /*
> > >   * Install a buffer_head into this cpu's LRU.  If not already in
> > > the LRU, it is
> > >   * inserted at the front, and the buffer_head at the back if any is
> > > evicted.
> > > - * Or, if already in the LRU it is moved to the front.
> > > + * Or, if already in the LRU it is moved to the front. Note that if
> > > LRU is
> > > + * disabled because of an ongoing page migration, we won't insert
> > > bh into the
> > > + * LRU.
> > 
> > And also, why do we need to do this?  The page LRU has no equivalent
> > mechanism to prevent new pages being added to the per-CPU LRU lists.
> > If a BH has just been used, isn't that a strong hint that this page is
> > a bad candidate for migration?
> 
> I had assumed that up until now, that pages in the page cache aren't an
> issue, such that they're dropped during migration as needed. Looking at
> try_to_free_buffers[1], I don't see any handling for the page cache.  I will
> need to do due diligence and follow up on this.
> 
> As for the question on necessity, if there is a case in which preventing
> buffer_heads from being added to the BH LRU ensures that the containing page
> can be migrated, then I would say that the change is justified, since adds
> another scenario in which migration is guaranteed (I will follow up on this
> as well).



First of all, Thanks for the work, Chris.

Looks like this is not only bh_lru problem but also general problem for
LRU pagevecs. Furthemore, there are other hidden cache meachnism to hold
additional page refcount until they are flush.
(I have seen pages in pagevec with additional refcount on migration
could make migration failure since early LRU draining right before
migrate_pages). Even though migrate_pages has retrial logic, it just
relies on the luck so the CMA allocation is still fragile for failure.

Ccing more folks, a random thought.
The idea is disable such cache mechanism for a while critical migration(
e.g., CMA) is going on. With the migrate_pending, we could apply draining
whenever we find additional refcount problem.

diff --git a/fs/buffer.c b/fs/buffer.c
index 96c7604f69b3..17b8c1efdbf3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -48,6 +48,7 @@
 #include <linux/sched/mm.h>
 #include <trace/events/block.h>
 #include <linux/fscrypt.h>
+#include <linux/migrate.h>
 
 #include "internal.h"
 
@@ -1300,6 +1301,9 @@ static void bh_lru_install(struct buffer_head *bh)
 	struct bh_lru *b;
 	int i;
 
+	if (migrate_pending())
+		return;
+
 	check_irqs_on();
 	bh_lru_lock();
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3a389633b68f..047d5358fe0d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -46,6 +46,8 @@ extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
 extern void putback_movable_page(struct page *page);
 
 extern void migrate_prep(void);
+extern void migrate_finish(void);
+extern bool migrate_pending(void);
 extern void migrate_prep_local(void);
 extern void migrate_page_states(struct page *newpage, struct page *page);
 extern void migrate_page_copy(struct page *newpage, struct page *page);
@@ -67,6 +69,7 @@ static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
 	{ return -EBUSY; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
+static inline void migrate_finish(void) {}
 static inline int migrate_prep_local(void) { return -ENOSYS; }
 
 static inline void migrate_page_states(struct page *newpage, struct page *page)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6961238c7ef5..46d9986c7bf0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1208,6 +1208,8 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
 			break;
 	}
 	mmap_read_unlock(mm);
+	migrate_finish();
+
 	if (err < 0)
 		return err;
 	return busy;
@@ -1371,6 +1373,10 @@ static long do_mbind(unsigned long start, unsigned long len,
 	mmap_write_unlock(mm);
 mpol_out:
 	mpol_put(new);
+
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_finish();
+
 	return err;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index a69da8aaeccd..3130a27e4e94 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -57,6 +57,8 @@
 
 #include "internal.h"
 
+static atomic_t migrate_pending_count = ATOMIC_INIT(0);
+
 /*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -64,13 +66,12 @@
  */
 void migrate_prep(void)
 {
+	atomic_inc(&migrate_pending_count);
 	/*
 	 * Clear the LRU lists so pages can be isolated.
-	 * Note that pages may be moved off the LRU after we have
-	 * drained them. Those pages will fail to migrate like other
-	 * pages that may be busy.
 	 */
 	lru_add_drain_all();
+	invalidate_bh_lrus();
 }
 
 /* Do the necessary work of migrate_prep but not if it involves other CPUs */
@@ -79,6 +80,16 @@ void migrate_prep_local(void)
 	lru_add_drain();
 }
 
+void migrate_finish(void)
+{
+	atomic_dec(&migrate_pending_count);
+}
+
+bool migrate_pending(void)
+{
+	return atomic_read(&migrate_pending_count);
+}
+
 int isolate_movable_page(struct page *page, isolate_mode_t mode)
 {
 	struct address_space *mapping;
@@ -1837,6 +1848,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 	if (err >= 0)
 		err = err1;
 out:
+	migrate_finish();
 	return err;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8fbee73dd6d..4ced6d559073 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8493,6 +8493,9 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
 				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
 	}
+
+	migrate_finish();
+
 	if (ret < 0) {
 		putback_movable_pages(&cc->migratepages);
 		return ret;
diff --git a/mm/swap.c b/mm/swap.c
index 31b844d4ed94..55f9e8c8ca5b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -36,6 +36,7 @@
 #include <linux/hugetlb.h>
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
+#include <linux/migrate.h>
 
 #include "internal.h"
 
@@ -252,7 +253,8 @@ void rotate_reclaimable_page(struct page *page)
 		get_page(page);
 		local_lock_irqsave(&lru_rotate.lock, flags);
 		pvec = this_cpu_ptr(&lru_rotate.pvec);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (!pagevec_add(pvec, page) || PageCompound(page)
+					|| migrate_pending())
 			pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
@@ -343,7 +345,8 @@ static void activate_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.activate_page);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (!pagevec_add(pvec, page) || PageCompound(page)
+					|| migrate_pending())
 			pagevec_lru_move_fn(pvec, __activate_page);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -458,7 +461,7 @@ void lru_cache_add(struct page *page)
 	get_page(page);
 	local_lock(&lru_pvecs.lock);
 	pvec = this_cpu_ptr(&lru_pvecs.lru_add);
-	if (!pagevec_add(pvec, page) || PageCompound(page))
+	if (!pagevec_add(pvec, page) || PageCompound(page) || migrate_pending())
 		__pagevec_lru_add(pvec);
 	local_unlock(&lru_pvecs.lock);
 }
@@ -654,7 +657,8 @@ void deactivate_file_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
 
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (!pagevec_add(pvec, page) || PageCompound(page) ||
+					migrate_pending())
 			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -676,7 +680,8 @@ void deactivate_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (!pagevec_add(pvec, page) || PageCompound(page) ||
+					migrate_pending())
 			pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -698,7 +703,8 @@ void mark_page_lazyfree(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (!pagevec_add(pvec, page) || PageCompound(page)
+					|| migrate_pending())
 			pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
diff mbox series

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index 96c7604..634e474 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1274,6 +1274,10 @@  struct bh_lru {
 
 static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }};
 
+/* These are used to control the BH LRU invalidation during page migration */
+static struct cpumask lru_needs_invalidation;
+static bool bh_lru_disabled = false;
+
 #ifdef CONFIG_SMP
 #define bh_lru_lock()	local_irq_disable()
 #define bh_lru_unlock()	local_irq_enable()
@@ -1292,7 +1296,9 @@  static inline void check_irqs_on(void)
 /*
  * Install a buffer_head into this cpu's LRU.  If not already in the LRU, it is
  * inserted at the front, and the buffer_head at the back if any is evicted.
- * Or, if already in the LRU it is moved to the front.
+ * Or, if already in the LRU it is moved to the front. Note that if LRU is
+ * disabled because of an ongoing page migration, we won't insert bh into the
+ * LRU.
  */
 static void bh_lru_install(struct buffer_head *bh)
 {
@@ -1303,6 +1309,9 @@  static void bh_lru_install(struct buffer_head *bh)
 	check_irqs_on();
 	bh_lru_lock();
 
+	if (bh_lru_disabled)
+		goto out;
+
 	b = this_cpu_ptr(&bh_lrus);
 	for (i = 0; i < BH_LRU_SIZE; i++) {
 		swap(evictee, b->bhs[i]);
@@ -1313,6 +1322,7 @@  static void bh_lru_install(struct buffer_head *bh)
 	}
 
 	get_bh(bh);
+out:
 	bh_lru_unlock();
 	brelse(evictee);
 }
@@ -1328,6 +1338,10 @@  lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
 
 	check_irqs_on();
 	bh_lru_lock();
+
+	if (bh_lru_disabled)
+		goto out;
+
 	for (i = 0; i < BH_LRU_SIZE; i++) {
 		struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
 
@@ -1346,6 +1360,7 @@  lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
 			break;
 		}
 	}
+out:
 	bh_lru_unlock();
 	return ret;
 }
@@ -1446,7 +1461,7 @@  EXPORT_SYMBOL(__bread_gfp);
  * This doesn't race because it runs in each cpu either in irq
  * or with preempt disabled.
  */
-static void invalidate_bh_lru(void *arg)
+void invalidate_bh_lru(void *arg)
 {
 	struct bh_lru *b = &get_cpu_var(bh_lrus);
 	int i;
@@ -1477,6 +1492,41 @@  void invalidate_bh_lrus(void)
 }
 EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
 
+bool need_bh_lru_invalidation(int cpu)
+{
+	return cpumask_test_cpu(cpu, &lru_needs_invalidation);
+}
+
+void bh_lru_disable(void)
+{
+	int cpu;
+
+	bh_lru_disabled = true;
+
+	/*
+	 * This barrier ensures that invocations of bh_lru_install()
+	 * after this barrier see that bh_lru_disabled == true (until
+	 * bh_lru_enable() is eventually called)..
+	 */
+	smp_mb();
+
+	/*
+	 * It's alright if someone comes along and hot-plugs a new CPU,
+	 * since we have that bh_lru_dsiabled == true.  The hot-remove
+	 * case is handled in buffer_exit_cpu_dead().
+	 */
+	for_each_online_cpu(cpu) {
+		if (has_bh_in_lru(cpu, NULL))
+			cpumask_set_cpu(cpu, &lru_needs_invalidation);
+	}
+}
+
+void bh_lru_enable(void)
+{
+	bh_lru_disabled = false;
+	cpumask_clear(&lru_needs_invalidation);
+}
+
 void set_bh_page(struct buffer_head *bh,
 		struct page *page, unsigned long offset)
 {
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 6b47f94..78eb5ee 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -193,7 +193,11 @@  void __breadahead_gfp(struct block_device *, sector_t block, unsigned int size,
 		  gfp_t gfp);
 struct buffer_head *__bread_gfp(struct block_device *,
 				sector_t block, unsigned size, gfp_t gfp);
+void invalidate_bh_lru(void *arg);
 void invalidate_bh_lrus(void);
+bool need_bh_lru_invalidation(int cpu);
+void bh_lru_disable(void);
+void bh_lru_enable(void);
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags);
 void free_buffer_head(struct buffer_head * bh);
 void unlock_buffer(struct buffer_head *bh);
@@ -401,6 +405,10 @@  extern int __set_page_dirty_buffers(struct page *page);
 #else /* CONFIG_BLOCK */
 
 static inline void buffer_init(void) {}
+static inline void invalidate_bh_lru(void) {}
+static inline bool need_bh_lru_invalidation(int cpu) { return false; }
+static inline void bh_lru_disable(void) {}
+static inline void bh_lru_enable(void) {}
 static inline int try_to_free_buffers(struct page *page) { return 1; }
 static inline int inode_has_buffers(struct inode *inode) { return 0; }
 static inline void invalidate_inode_buffers(struct inode *inode) {}
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3a38963..9e4a2dc 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -46,6 +46,7 @@  extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
 extern void putback_movable_page(struct page *page);
 
 extern void migrate_prep(void);
+extern void migrate_finish(void);
 extern void migrate_prep_local(void);
 extern void migrate_page_states(struct page *newpage, struct page *page);
 extern void migrate_page_copy(struct page *newpage, struct page *page);
@@ -67,6 +68,7 @@  static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
 	{ return -EBUSY; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
+static inline int migrate_finish(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
 
 static inline void migrate_page_states(struct page *newpage, struct page *page)
diff --git a/mm/migrate.c b/mm/migrate.c
index a69da8a..a8928ee7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -65,6 +65,16 @@ 
 void migrate_prep(void)
 {
 	/*
+	 * If a page has buffer_heads contained in one of the per-cpu
+	 * BH LRU caches, that page can't be migrated.  Accordingly, we
+	 * call bh_lru_disable() to prevent further buffer_heads from
+	 * being cached, before we invalidate the LRUs in
+	 * lru_add_drain_all().  The LRUs are re-enabled in
+	 * migrate_finish().
+	 */
+	bh_lru_disable();
+
+	/*
 	 * Clear the LRU lists so pages can be isolated.
 	 * Note that pages may be moved off the LRU after we have
 	 * drained them. Those pages will fail to migrate like other
@@ -73,6 +83,15 @@  void migrate_prep(void)
 	lru_add_drain_all();
 }
 
+void migrate_finish(void)
+{
+	/*
+	 * Renable the per-cpu BH LRU caches, after having disabled them
+	 * in migrate_prep().
+	 */
+	bh_lru_enable();
+}
+
 /* Do the necessary work of migrate_prep but not if it involves other CPUs */
 void migrate_prep_local(void)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6446778..e4cb959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8493,6 +8493,9 @@  static int __alloc_contig_migrate_range(struct compact_control *cc,
 		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
 				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
 	}
+
+	migrate_finish();
+
 	if (ret < 0) {
 		putback_movable_pages(&cc->migratepages);
 		return ret;
diff --git a/mm/swap.c b/mm/swap.c
index 31b844d..c733c95 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -36,6 +36,7 @@ 
 #include <linux/hugetlb.h>
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
+#include <linux/buffer_head.h>
 
 #include "internal.h"
 
@@ -628,6 +629,9 @@  void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 
+	if (need_bh_lru_invalidation(cpu))
+		invalidate_bh_lru(NULL);
+
 	activate_page_drain(cpu);
 }
 
@@ -815,7 +819,8 @@  void lru_add_drain_all(void)
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		    need_activate_page_drain(cpu) ||
+		    need_bh_lru_invalidation(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, mm_percpu_wq, work);
 			__cpumask_set_cpu(cpu, &has_work);