[v2,10/13] mm: memcontrol: use obj_cgroup APIs to charge the LRU pages

Message ID	20210916134748.67712-11-songmuchun@bytedance.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=V4uO=OG=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org DED0160EB4 From: Muchun Song <songmuchun@bytedance.com> To: guro@fb.com, hannes@cmpxchg.org, mhocko@kernel.org, akpm@linux-foundation.org, shakeelb@google.com, vdavydov.dev@gmail.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, fam.zheng@bytedance.com, bsingharora@gmail.com, shy828301@gmail.com, alexs@kernel.org, smuchun@gmail.com, zhengqi.arch@bytedance.com, Muchun Song <songmuchun@bytedance.com> Subject: [PATCH v2 10/13] mm: memcontrol: use obj_cgroup APIs to charge the LRU pages Date: Thu, 16 Sep 2021 21:47:45 +0800 Message-Id: <20210916134748.67712-11-songmuchun@bytedance.com> In-Reply-To: <20210916134748.67712-1-songmuchun@bytedance.com> References: <20210916134748.67712-1-songmuchun@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Use obj_cgroup APIs to charge the LRU pages \| expand [v2,00/13] Use obj_cgroup APIs to charge the LRU pages [v2,01/13] mm: move mem_cgroup_kmem_disabled() to memcontrol.h [v2,02/13] mm: memcontrol: prepare objcg API for non-kmem usage [v2,03/13] mm: memcontrol: introduce compact_lock_page_irqsave [v2,04/13] mm: memcontrol: make lruvec lock safe when the LRU pages reparented [v2,05/13] mm: vmscan: rework move_pages_to_lru() [v2,06/13] mm: thp: introduce split_queue_lock/unlock{_irqsave}() [v2,07/13] mm: thp: make split queue lock safe when LRU pages reparented [v2,08/13] mm: memcontrol: make all the callers of page_memcg() safe [v2,09/13] mm: memcontrol: introduce memcg_reparent_ops [v2,10/13] mm: memcontrol: use obj_cgroup APIs to charge the LRU pages [v2,11/13] mm: memcontrol: rename {un}lock_page_memcg() to {un}lock_page_objcg() [v2,12/13] mm: lru: add VM_BUG_ON_PAGE to lru maintenance function [v2,13/13] mm: lru: use lruvec lock to serialize memcg changes

Hi Muchun, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on tj-cgroup/for-next] [also build test WARNING on linus/master v5.15-rc1] [cannot apply to hnaz-mm/master next-20210917] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Muchun-Song/Use-obj_cgroup-APIs-to-charge-the-LRU-pages/20210916-215452 base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next config: i386-randconfig-s002-20210916 (attached as .config) compiler: gcc-9 (Debian 9.3.0-22) 9.3.0 reproduce: # apt-get install sparse # sparse version: v0.6.4-dirty # https://github.com/0day-ci/linux/commit/a91949623178a5b2ef32a0842f6b6bae02a1a696 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Muchun-Song/Use-obj_cgroup-APIs-to-charge-the-LRU-pages/20210916-215452 git checkout a91949623178a5b2ef32a0842f6b6bae02a1a696 # save the attached .config to linux build tree make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> sparse warnings: (new ones prefixed by >>) mm/memcontrol.c:4227:21: sparse: sparse: incompatible types in comparison expression (different address spaces): mm/memcontrol.c:4227:21: sparse: struct mem_cgroup_threshold_ary [noderef] __rcu * mm/memcontrol.c:4227:21: sparse: struct mem_cgroup_threshold_ary * mm/memcontrol.c:4229:21: sparse: sparse: incompatible types in comparison expression (different address spaces): mm/memcontrol.c:4229:21: sparse: struct mem_cgroup_threshold_ary [noderef] __rcu * mm/memcontrol.c:4229:21: sparse: struct mem_cgroup_threshold_ary * mm/memcontrol.c:4385:9: sparse: sparse: incompatible types in comparison expression (different address spaces): mm/memcontrol.c:4385:9: sparse: struct mem_cgroup_threshold_ary [noderef] __rcu * mm/memcontrol.c:4385:9: sparse: struct mem_cgroup_threshold_ary * mm/memcontrol.c:4479:9: sparse: sparse: incompatible types in comparison expression (different address spaces): mm/memcontrol.c:4479:9: sparse: struct mem_cgroup_threshold_ary [noderef] __rcu * mm/memcontrol.c:4479:9: sparse: struct mem_cgroup_threshold_ary * >> mm/memcontrol.c:5832:26: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct obj_cgroup *objcg @@ got struct obj_cgroup [noderef] __rcu *objcg @@ mm/memcontrol.c:5833:28: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct obj_cgroup *objcg @@ got struct obj_cgroup [noderef] __rcu *objcg @@ mm/memcontrol.c:6128:23: sparse: sparse: incompatible types in comparison expression (different address spaces): mm/memcontrol.c:6128:23: sparse: struct task_struct [noderef] __rcu * mm/memcontrol.c:6128:23: sparse: struct task_struct * mm/memcontrol.c: note: in included file: include/linux/memcontrol.h:760:9: sparse: sparse: context imbalance in 'memcg_reparent_lruvec_lock' - wrong count at exit include/linux/memcontrol.h:760:9: sparse: sparse: context imbalance in 'memcg_reparent_lruvec_unlock' - unexpected unlock mm/memcontrol.c: note: in included file (through include/linux/rculist.h, include/linux/pid.h, include/linux/sched.h, ...): include/linux/rcupdate.h:718:9: sparse: sparse: context imbalance in 'lock_page_lruvec' - wrong count at exit include/linux/rcupdate.h:718:9: sparse: sparse: context imbalance in 'lock_page_lruvec_irq' - wrong count at exit include/linux/rcupdate.h:718:9: sparse: sparse: context imbalance in 'lock_page_lruvec_irqsave' - wrong count at exit mm/memcontrol.c:2109:6: sparse: sparse: context imbalance in 'lock_page_memcg' - wrong count at exit mm/memcontrol.c:2160:17: sparse: sparse: context imbalance in '__unlock_page_memcg' - unexpected unlock vim +5832 mm/memcontrol.c 5730 5731 /** 5732 * mem_cgroup_move_account - move account of the page 5733 * @page: the page 5734 * @compound: charge the page as compound or small page 5735 * @from: mem_cgroup which the page is moved from. 5736 * @to: mem_cgroup which the page is moved to. @from != @to. 5737 * 5738 * The caller must make sure the page is not on LRU (isolate_page() is useful.) 5739 * 5740 * This function doesn't do "charge" to new cgroup and doesn't do "uncharge" 5741 * from old cgroup. 5742 */ 5743 static int mem_cgroup_move_account(struct page *page, 5744 bool compound, 5745 struct mem_cgroup *from, 5746 struct mem_cgroup *to) 5747 { 5748 struct lruvec *from_vec, *to_vec; 5749 struct pglist_data *pgdat; 5750 unsigned int nr_pages = compound ? thp_nr_pages(page) : 1; 5751 int ret; 5752 5753 VM_BUG_ON(from == to); 5754 VM_BUG_ON_PAGE(PageLRU(page), page); 5755 VM_BUG_ON(compound && !PageTransHuge(page)); 5756 5757 /* 5758 * Prevent mem_cgroup_migrate() from looking at 5759 * page's memory cgroup of its source page while we change it. 5760 */ 5761 ret = -EBUSY; 5762 if (!trylock_page(page)) 5763 goto out; 5764 5765 ret = -EINVAL; 5766 if (page_memcg(page) != from) 5767 goto out_unlock; 5768 5769 pgdat = page_pgdat(page); 5770 from_vec = mem_cgroup_lruvec(from, pgdat); 5771 to_vec = mem_cgroup_lruvec(to, pgdat); 5772 5773 lock_page_memcg(page); 5774 5775 if (PageAnon(page)) { 5776 if (page_mapped(page)) { 5777 __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); 5778 __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); 5779 if (PageTransHuge(page)) { 5780 __mod_lruvec_state(from_vec, NR_ANON_THPS, 5781 -nr_pages); 5782 __mod_lruvec_state(to_vec, NR_ANON_THPS, 5783 nr_pages); 5784 } 5785 } 5786 } else { 5787 __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages); 5788 __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages); 5789 5790 if (PageSwapBacked(page)) { 5791 __mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages); 5792 __mod_lruvec_state(to_vec, NR_SHMEM, nr_pages); 5793 } 5794 5795 if (page_mapped(page)) { 5796 __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); 5797 __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); 5798 } 5799 5800 if (PageDirty(page)) { 5801 struct address_space *mapping = page_mapping(page); 5802 5803 if (mapping_can_writeback(mapping)) { 5804 __mod_lruvec_state(from_vec, NR_FILE_DIRTY, 5805 -nr_pages); 5806 __mod_lruvec_state(to_vec, NR_FILE_DIRTY, 5807 nr_pages); 5808 } 5809 } 5810 } 5811 5812 if (PageWriteback(page)) { 5813 __mod_lruvec_state(from_vec, NR_WRITEBACK, -nr_pages); 5814 __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages); 5815 } 5816 5817 /* 5818 * All state has been migrated, let's switch to the new memcg. 5819 * 5820 * It is safe to change page's memcg here because the page 5821 * is referenced, charged, isolated, and locked: we can't race 5822 * with (un)charging, migration, LRU putback, or anything else 5823 * that would rely on a stable page's memory cgroup. 5824 * 5825 * Note that lock_page_memcg is a memcg lock, not a page lock, 5826 * to save space. As soon as we switch page's memory cgroup to a 5827 * new memcg that isn't locked, the above state can change 5828 * concurrently again. Make sure we're truly done with it. 5829 */ 5830 smp_mb(); 5831 > 5832 obj_cgroup_get(to->objcg); 5833 obj_cgroup_put(from->objcg); 5834 5835 page->memcg_data = (unsigned long)to->objcg; 5836 5837 __unlock_page_memcg(from); 5838 5839 ret = 0; 5840 5841 local_irq_disable(); 5842 mem_cgroup_charge_statistics(to, page, nr_pages); 5843 memcg_check_events(to, page); 5844 mem_cgroup_charge_statistics(from, page, -nr_pages); 5845 memcg_check_events(from, page); 5846 local_irq_enable(); 5847 out_unlock: 5848 unlock_page(page); 5849 out: 5850 return ret; 5851 } 5852 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 18344c1f4333..3d9691395cf3 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -376,8 +376,6 @@ enum page_memcg_data_flags { #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) -static inline bool PageMemcgKmem(struct page *page); - /* * After the initialization objcg->memcg is always pointing at * a valid memcg, but can be atomically swapped to the parent memcg. @@ -391,43 +389,19 @@ static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg) } /* - * __page_memcg - get the memory cgroup associated with a non-kmem page - * @page: a pointer to the page struct - * - * Returns a pointer to the memory cgroup associated with the page, - * or NULL. This function assumes that the page is known to have a - * proper memory cgroup pointer. It's not safe to call this function - * against some type of pages, e.g. slab pages or ex-slab pages or - * kmem pages. - */ -static inline struct mem_cgroup *__page_memcg(struct page *page) -{ - unsigned long memcg_data = page->memcg_data; - - VM_BUG_ON_PAGE(PageSlab(page), page); - VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); - VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page); - - return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); -} - -/* - * __page_objcg - get the object cgroup associated with a kmem page + * page_objcg - get the object cgroup associated with page * @page: a pointer to the page struct * * Returns a pointer to the object cgroup associated with the page, * or NULL. This function assumes that the page is known to have a - * proper object cgroup pointer. It's not safe to call this function - * against some type of pages, e.g. slab pages or ex-slab pages or - * LRU pages. + * proper object cgroup pointer. */ -static inline struct obj_cgroup *__page_objcg(struct page *page) +static inline struct obj_cgroup *page_objcg(struct page *page) { unsigned long memcg_data = page->memcg_data; VM_BUG_ON_PAGE(PageSlab(page), page); VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); - VM_BUG_ON_PAGE(!(memcg_data & MEMCG_DATA_KMEM), page); return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); } @@ -441,23 +415,35 @@ static inline struct obj_cgroup *__page_objcg(struct page *page) * proper memory cgroup pointer. It's not safe to call this function * against some type of pages, e.g. slab pages or ex-slab pages. * - * For a non-kmem page any of the following ensures page and memcg binding - * stability: + * For a page any of the following ensures page and objcg binding stability: * * - the page lock * - LRU isolation * - lock_page_memcg() * - exclusive reference * - * For a kmem page a caller should hold an rcu read lock to protect memcg - * associated with a kmem page from being released. + * Based on the stable binding of page and objcg, for a page any of the + * following ensures page and memcg binding stability: + * + * - css_set_lock + * - cgroup_mutex + * - the lruvec lock + * - the split queue lock (only THP page) + * + * If the caller only want to ensure that the page counters of memcg are + * updated correctly, ensure that the binding stability of page and objcg + * is sufficient. + * + * A caller should hold an rcu read lock (In addition, regions of code across + * which interrupts, preemption, or softirqs have been disabled also serve as + * RCU read-side critical sections) to protect memcg associated with a page + * from being released. */ static inline struct mem_cgroup *page_memcg(struct page *page) { - if (PageMemcgKmem(page)) - return obj_cgroup_memcg(__page_objcg(page)); - else - return __page_memcg(page); + struct obj_cgroup *objcg = page_objcg(page); + + return objcg ? obj_cgroup_memcg(objcg) : NULL; } /* @@ -470,6 +456,8 @@ static inline struct mem_cgroup *page_memcg(struct page *page) * is known to have a proper memory cgroup pointer. It's not safe to call * this function against some type of pages, e.g. slab pages or ex-slab * pages. + * + * The page and objcg or memcg binding rules can refer to page_memcg(). */ static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page) { @@ -493,22 +481,20 @@ static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page) * or NULL. This function assumes that the page is known to have a * proper memory cgroup pointer. It's not safe to call this function * against some type of pages, e.g. slab pages or ex-slab pages. + * + * The page and objcg or memcg binding rules can refer to page_memcg(). */ static inline struct mem_cgroup *page_memcg_rcu(struct page *page) { unsigned long memcg_data = READ_ONCE(page->memcg_data); + struct obj_cgroup *objcg; VM_BUG_ON_PAGE(PageSlab(page), page); WARN_ON_ONCE(!rcu_read_lock_held()); - if (memcg_data & MEMCG_DATA_KMEM) { - struct obj_cgroup *objcg; - - objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); - return obj_cgroup_memcg(objcg); - } + objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); - return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); + return objcg ? obj_cgroup_memcg(objcg) : NULL; } /* @@ -521,16 +507,10 @@ static inline struct mem_cgroup *page_memcg_rcu(struct page *page) * has an associated memory cgroup pointer or an object cgroups vector or * an object cgroup. * - * For a non-kmem page any of the following ensures page and memcg binding - * stability: - * - * - the page lock - * - LRU isolation - * - lock_page_memcg() - * - exclusive reference + * The page and objcg or memcg binding rules can refer to page_memcg(). * - * For a kmem page a caller should hold an rcu read lock to protect memcg - * associated with a kmem page from being released. + * A caller should hold an rcu read lock to protect memcg associated with a + * page from being released. */ static inline struct mem_cgroup *page_memcg_check(struct page *page) { @@ -539,18 +519,14 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page) * for slab pages, READ_ONCE() should be used here. */ unsigned long memcg_data = READ_ONCE(page->memcg_data); + struct obj_cgroup *objcg; if (memcg_data & MEMCG_DATA_OBJCGS) return NULL; - if (memcg_data & MEMCG_DATA_KMEM) { - struct obj_cgroup *objcg; - - objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); - return obj_cgroup_memcg(objcg); - } + objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); - return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); + return objcg ? obj_cgroup_memcg(objcg) : NULL; } #ifdef CONFIG_MEMCG_KMEM diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 12950d4988e6..d6738637feae 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -499,6 +499,48 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) } #ifdef CONFIG_MEMCG +static struct shrinker deferred_split_shrinker; + +static void memcg_reparent_split_queue_lock(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + spin_lock(&memcg->deferred_split_queue.split_queue_lock); + spin_lock(&parent->deferred_split_queue.split_queue_lock); +} + +static void memcg_reparent_split_queue_unlock(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + spin_unlock(&parent->deferred_split_queue.split_queue_lock); + spin_unlock(&memcg->deferred_split_queue.split_queue_lock); +} + +static void memcg_reparent_split_queue(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + int nid; + struct deferred_split *src, *dst; + + src = &memcg->deferred_split_queue; + dst = &parent->deferred_split_queue; + + if (!src->split_queue_len) + return; + + list_splice_tail_init(&src->split_queue, &dst->split_queue); + dst->split_queue_len += src->split_queue_len; + src->split_queue_len = 0; + + for_each_node(nid) + set_shrinker_bit(parent, nid, deferred_split_shrinker.id); +} + +const struct memcg_reparent_ops split_queue_reparent_ops = { + .lock = memcg_reparent_split_queue_lock, + .unlock = memcg_reparent_split_queue_unlock, + .reparent = memcg_reparent_split_queue, +}; + static inline struct mem_cgroup *split_queue_memcg(struct deferred_split *queue) { if (mem_cgroup_disabled()) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3a73fd192734..3688651d85c2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -75,6 +75,7 @@ struct cgroup_subsys memory_cgrp_subsys __read_mostly; EXPORT_SYMBOL(memory_cgrp_subsys); struct mem_cgroup *root_mem_cgroup __read_mostly; +static struct obj_cgroup *root_obj_cgroup __read_mostly; /* Active memory cgroup to use from an interrupt context */ DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg); @@ -261,6 +262,11 @@ struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr) return container_of(vmpr, struct mem_cgroup, vmpressure); } +static inline bool obj_cgroup_is_root(struct obj_cgroup *objcg) +{ + return objcg == root_obj_cgroup; +} + extern spinlock_t css_set_lock; static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, @@ -333,7 +339,81 @@ static struct obj_cgroup *obj_cgroup_alloc(void) return objcg; } -static const struct memcg_reparent_ops *memcg_reparent_ops[] = {}; +static void memcg_reparent_lruvec_lock(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + int i; + + for_each_node(i) { + spin_lock(&mem_cgroup_lruvec(memcg, NODE_DATA(i))->lru_lock); + spin_lock(&mem_cgroup_lruvec(parent, NODE_DATA(i))->lru_lock); + } +} + +static void memcg_reparent_lruvec_unlock(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + int i; + + for_each_node(i) { + spin_unlock(&mem_cgroup_lruvec(parent, NODE_DATA(i))->lru_lock); + spin_unlock(&mem_cgroup_lruvec(memcg, NODE_DATA(i))->lru_lock); + } +} + +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst, + enum lru_list lru) +{ + int zid; + struct mem_cgroup_per_node *mz_src, *mz_dst; + + mz_src = container_of(src, struct mem_cgroup_per_node, lruvec); + mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec); + + list_splice_tail_init(&src->lists[lru], &dst->lists[lru]); + + for (zid = 0; zid < MAX_NR_ZONES; zid++) { + mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru]; + mz_src->lru_zone_size[zid][lru] = 0; + } +} + +static void memcg_reparent_lruvec(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + int i; + + for_each_node(i) { + enum lru_list lru; + struct lruvec *src, *dst; + + src = mem_cgroup_lruvec(memcg, NODE_DATA(i)); + dst = mem_cgroup_lruvec(parent, NODE_DATA(i)); + + dst->anon_cost += src->anon_cost; + dst->file_cost += src->file_cost; + + for_each_lru(lru) + lruvec_reparent_lru(src, dst, lru); + } +} + +static const struct memcg_reparent_ops lruvec_reparent_ops = { + .lock = memcg_reparent_lruvec_lock, + .unlock = memcg_reparent_lruvec_unlock, + .reparent = memcg_reparent_lruvec, +}; + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +extern struct memcg_reparent_ops split_queue_reparent_ops; +#endif + +static const struct memcg_reparent_ops *memcg_reparent_ops[] = { + &lruvec_reparent_ops, +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + &split_queue_reparent_ops, +#endif +}; static void memcg_reparent_lock(struct mem_cgroup *memcg, struct mem_cgroup *parent) @@ -2797,18 +2877,18 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) } #endif -static void commit_charge(struct page *page, struct mem_cgroup *memcg) +static void commit_charge(struct page *page, struct obj_cgroup *objcg) { - VM_BUG_ON_PAGE(page_memcg(page), page); + VM_BUG_ON_PAGE(page_objcg(page), page); /* - * Any of the following ensures page's memcg stability: + * Any of the following ensures page's objcg stability: * * - the page lock * - LRU isolation * - lock_page_memcg() * - exclusive reference */ - page->memcg_data = (unsigned long)memcg; + page->memcg_data = (unsigned long)objcg; } static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg) @@ -2825,6 +2905,21 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg) return memcg; } +static struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) +{ + struct obj_cgroup *objcg = NULL; + + rcu_read_lock(); + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + objcg = rcu_dereference(memcg->objcg); + if (objcg && obj_cgroup_tryget(objcg)) + break; + } + rcu_read_unlock(); + + return objcg; +} + #ifdef CONFIG_MEMCG_KMEM /* * The allocated objcg pointers array is not accounted directly. @@ -2930,12 +3025,15 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void) else memcg = mem_cgroup_from_task(current); - for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) { - objcg = rcu_dereference(memcg->objcg); - if (objcg && obj_cgroup_tryget(objcg)) - break; + if (mem_cgroup_is_root(memcg)) + goto out; + + objcg = get_obj_cgroup_from_memcg(memcg); + if (obj_cgroup_is_root(objcg)) { + obj_cgroup_put(objcg); objcg = NULL; } +out: rcu_read_unlock(); return objcg; @@ -3078,13 +3176,13 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order) */ void __memcg_kmem_uncharge_page(struct page *page, int order) { - struct obj_cgroup *objcg; + struct obj_cgroup *objcg = page_objcg(page); unsigned int nr_pages = 1 << order; - if (!PageMemcgKmem(page)) + if (!objcg) return; - objcg = __page_objcg(page); + VM_BUG_ON_PAGE(!PageMemcgKmem(page), page); obj_cgroup_uncharge_pages(objcg, nr_pages); page->memcg_data = 0; obj_cgroup_put(objcg); @@ -3316,23 +3414,20 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) #endif /* CONFIG_MEMCG_KMEM */ /* - * Because page_memcg(head) is not set on tails, set it now. + * Because page_objcg(head) is not set on tails, set it now. */ void split_page_memcg(struct page *head, unsigned int nr) { - struct mem_cgroup *memcg = page_memcg(head); + struct obj_cgroup *objcg = page_objcg(head); int i; - if (mem_cgroup_disabled() || !memcg) + if (mem_cgroup_disabled() || !objcg) return; for (i = 1; i < nr; i++) head[i].memcg_data = head->memcg_data; - if (PageMemcgKmem(head)) - obj_cgroup_get_many(__page_objcg(head), nr - 1); - else - css_get_many(&memcg->css, nr - 1); + obj_cgroup_get_many(objcg, nr - 1); } #ifdef CONFIG_MEMCG_SWAP @@ -5303,6 +5398,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) objcg->memcg = memcg; rcu_assign_pointer(memcg->objcg, objcg); + if (unlikely(mem_cgroup_is_root(memcg))) + root_obj_cgroup = objcg; + /* Online state pins memcg ID, memcg ID pins CSS */ refcount_set(&memcg->id.ref, 1); css_get(css); @@ -5731,10 +5829,10 @@ static int mem_cgroup_move_account(struct page *page, */ smp_mb(); - css_get(&to->css); - css_put(&from->css); + obj_cgroup_get(to->objcg); + obj_cgroup_put(from->objcg); - page->memcg_data = (unsigned long)to; + page->memcg_data = (unsigned long)to->objcg; __unlock_page_memcg(from); @@ -6206,6 +6304,42 @@ static void mem_cgroup_move_charge(void) mmap_read_unlock(mc.mm); atomic_dec(&mc.from->moving_account); + + /* + * Moving its pages to another memcg is finished. Wait for already + * started RCU-only updates to finish to make sure that the caller + * of lock_page_memcg() can unlock the correct move_lock. The + * possible bad scenario would like: + * + * CPU0: CPU1: + * mem_cgroup_move_charge() + * walk_page_range() + * + * lock_page_memcg(page) + * memcg = page_memcg(page) + * spin_lock_irqsave(&memcg->move_lock) + * memcg->move_lock_task = current + * + * atomic_dec(&mc.from->moving_account) + * + * mem_cgroup_css_offline() + * memcg_offline_kmem() + * memcg_reparent_objcgs() <== reparented + * + * unlock_page_memcg(page) + * memcg = page_memcg(page) <== memcg has been changed + * if (memcg->move_lock_task == current) <== false + * spin_unlock_irqrestore(&memcg->move_lock) + * + * Once mem_cgroup_move_charge() returns (it means that the cgroup_mutex + * would be released soon), the page can be reparented to its parent + * memcg. When the unlock_page_memcg() is called for the page, we will + * miss unlock the move_lock. So using synchronize_rcu to wait for + * already started RCU-only updates to finish before this function + * returns (mem_cgroup_move_charge() and mem_cgroup_css_offline() are + * serialized by cgroup_mutex). + */ + synchronize_rcu(); } /* @@ -6762,21 +6896,26 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, static int charge_memcg(struct page *page, struct mem_cgroup *memcg, gfp_t gfp) { + struct obj_cgroup *objcg; unsigned int nr_pages = thp_nr_pages(page); - int ret; + int ret = 0; - ret = try_charge(memcg, gfp, nr_pages); + objcg = get_obj_cgroup_from_memcg(memcg); + /* Do not account at the root objcg level. */ + if (!obj_cgroup_is_root(objcg)) + ret = try_charge(memcg, gfp, nr_pages); if (ret) goto out; - css_get(&memcg->css); - commit_charge(page, memcg); + obj_cgroup_get(objcg); + commit_charge(page, objcg); local_irq_disable(); mem_cgroup_charge_statistics(memcg, page, nr_pages); memcg_check_events(memcg, page); local_irq_enable(); out: + obj_cgroup_put(objcg); return ret; } @@ -6876,7 +7015,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) } struct uncharge_gather { - struct mem_cgroup *memcg; + struct obj_cgroup *objcg; unsigned long nr_memory; unsigned long pgpgout; unsigned long nr_kmem; @@ -6891,84 +7030,72 @@ static inline void uncharge_gather_clear(struct uncharge_gather *ug) static void uncharge_batch(const struct uncharge_gather *ug) { unsigned long flags; + struct mem_cgroup *memcg; + rcu_read_lock(); + memcg = obj_cgroup_memcg(ug->objcg); if (ug->nr_memory) { - page_counter_uncharge(&ug->memcg->memory, ug->nr_memory); + page_counter_uncharge(&memcg->memory, ug->nr_memory); if (do_memsw_account()) - page_counter_uncharge(&ug->memcg->memsw, ug->nr_memory); + page_counter_uncharge(&memcg->memsw, ug->nr_memory); if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem) - page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem); - memcg_oom_recover(ug->memcg); + page_counter_uncharge(&memcg->kmem, ug->nr_kmem); + memcg_oom_recover(memcg); } local_irq_save(flags); - __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); - __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); - memcg_check_events(ug->memcg, ug->dummy_page); + __count_memcg_events(memcg, PGPGOUT, ug->pgpgout); + __this_cpu_add(memcg->vmstats_percpu->nr_page_events, ug->nr_memory); + memcg_check_events(memcg, ug->dummy_page); local_irq_restore(flags); + rcu_read_unlock(); /* drop reference from uncharge_page */ - css_put(&ug->memcg->css); + obj_cgroup_put(ug->objcg); } static void uncharge_page(struct page *page, struct uncharge_gather *ug) { unsigned long nr_pages; - struct mem_cgroup *memcg; struct obj_cgroup *objcg; - bool use_objcg = PageMemcgKmem(page); VM_BUG_ON_PAGE(PageLRU(page), page); /* * Nobody should be changing or seriously looking at - * page memcg or objcg at this point, we have fully - * exclusive access to the page. + * page objcg at this point, we have fully exclusive + * access to the page. */ - if (use_objcg) { - objcg = __page_objcg(page); - /* - * This get matches the put at the end of the function and - * kmem pages do not hold memcg references anymore. - */ - memcg = get_mem_cgroup_from_objcg(objcg); - } else { - memcg = __page_memcg(page); - } - - if (!memcg) + objcg = page_objcg(page); + if (!objcg) return; - if (ug->memcg != memcg) { - if (ug->memcg) { + if (ug->objcg != objcg) { + if (ug->objcg) { uncharge_batch(ug); uncharge_gather_clear(ug); } - ug->memcg = memcg; + ug->objcg = objcg; ug->dummy_page = page; - /* pairs with css_put in uncharge_batch */ - css_get(&memcg->css); + /* pairs with obj_cgroup_put in uncharge_batch */ + obj_cgroup_get(objcg); } nr_pages = compound_nr(page); - if (use_objcg) { + if (PageMemcgKmem(page)) { ug->nr_memory += nr_pages; ug->nr_kmem += nr_pages; - - page->memcg_data = 0; - obj_cgroup_put(objcg); } else { /* LRU pages aren't accounted at the root level */ - if (!mem_cgroup_is_root(memcg)) + if (!obj_cgroup_is_root(objcg)) ug->nr_memory += nr_pages; ug->pgpgout++; - - page->memcg_data = 0; } - css_put(&memcg->css); + page->memcg_data = 0; + obj_cgroup_put(objcg); } /** @@ -6982,7 +7109,7 @@ void __mem_cgroup_uncharge(struct page *page) struct uncharge_gather ug; /* Don't touch page->lru of any random page, pre-check: */ - if (!page_memcg(page)) + if (!page_objcg(page)) return; uncharge_gather_clear(&ug); @@ -7005,7 +7132,7 @@ void __mem_cgroup_uncharge_list(struct list_head *page_list) uncharge_gather_clear(&ug); list_for_each_entry(page, page_list, lru) uncharge_page(page, &ug); - if (ug.memcg) + if (ug.objcg) uncharge_batch(&ug); } @@ -7022,6 +7149,7 @@ void __mem_cgroup_uncharge_list(struct list_head *page_list) void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) { struct mem_cgroup *memcg; + struct obj_cgroup *objcg; unsigned int nr_pages; unsigned long flags; @@ -7035,32 +7163,34 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) return; /* Page cache replacement: new page already charged? */ - if (page_memcg(newpage)) + if (page_objcg(newpage)) return; - memcg = get_mem_cgroup_from_page(oldpage); - VM_WARN_ON_ONCE_PAGE(!memcg, oldpage); - if (!memcg) + objcg = page_objcg(oldpage); + VM_WARN_ON_ONCE_PAGE(!objcg, oldpage); + if (!objcg) return; /* Force-charge the new page. The old one will be freed soon */ nr_pages = thp_nr_pages(newpage); - if (!mem_cgroup_is_root(memcg)) { + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + + if (!obj_cgroup_is_root(objcg)) { page_counter_charge(&memcg->memory, nr_pages); if (do_memsw_account()) page_counter_charge(&memcg->memsw, nr_pages); } - css_get(&memcg->css); - commit_charge(newpage, memcg); + obj_cgroup_get(objcg); + commit_charge(newpage, objcg); local_irq_save(flags); mem_cgroup_charge_statistics(memcg, newpage, nr_pages); memcg_check_events(memcg, newpage); local_irq_restore(flags); - - css_put(&memcg->css); + rcu_read_unlock(); } DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); @@ -7235,6 +7365,7 @@ static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg) void mem_cgroup_swapout(struct page *page, swp_entry_t entry) { struct mem_cgroup *memcg, *swap_memcg; + struct obj_cgroup *objcg; unsigned int nr_entries; unsigned short oldid; @@ -7247,15 +7378,16 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) return; + objcg = page_objcg(page); + VM_WARN_ON_ONCE_PAGE(!objcg, page); + if (!objcg) + return; + /* * Interrupts should be disabled by the caller (see the comments below), * which can serve as RCU read-side critical sections. */ - memcg = page_memcg(page); - - VM_WARN_ON_ONCE_PAGE(!memcg, page); - if (!memcg) - return; + memcg = obj_cgroup_memcg(objcg); /* * In case the memcg owning these pages has been offlined and doesn't @@ -7274,7 +7406,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) page->memcg_data = 0; - if (!mem_cgroup_is_root(memcg)) + if (!obj_cgroup_is_root(objcg)) page_counter_uncharge(&memcg->memory, nr_entries); if (!cgroup_memory_noswap && memcg != swap_memcg) { @@ -7293,7 +7425,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) mem_cgroup_charge_statistics(memcg, page, -nr_entries); memcg_check_events(memcg, page); - css_put(&memcg->css); + obj_cgroup_put(objcg); } /**

[v2,10/13] mm: memcontrol: use obj_cgroup APIs to charge the LRU pages

Commit Message

Comments

Patch