Message ID | 20230504082510.247-1-sehuww@mail.scut.edu.cn (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ceph: fix excessive page cache usage | expand |
Hi Weiwen, As discussed in another thread I have fold your change to my fix in V3. Thanks - Xiubo On 5/4/23 16:25, Hu Weiwen wrote: > Currently, `ceph_netfs_expand_readahead()` tries to align the read > request with strip_unit, which by default is set to 4MB. This means > that small files will require at least 4MB of page cache, leading to > inefficient usage of the page cache. > > Bound `rreq->len` to the actual file size to restore the previous page > cache usage. > > Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead") > Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn> > --- > > We recently updated our kernel. And we are investigating the performance > regression on our machine learning jobs. For example, one of our jobs > repeatedly read a dataset of 62GB, 100k files. I expect all these IO > request would hit the page cache, since we have more that 100GB memory > for cache. However, a lot of network IO is observed, and our HDD ceph > cluster is fully loaded, resulting in very bad performance. > > The regression is bisected to commit > 49870056005c ("ceph: convert ceph_readpages to ceph_readahead"). > This commit is merged in kernel 5.13. After this commit, we need 400GB > of memory to fully cache these 100k files, which is unacceptable. > > The post-EOF page cache is populated at: > (gathered by `perf record -a -e filemap:mm_filemap_add_to_page_cache -g sleep 2`) > > python 3619706 [005] 3103609.736344: filemap:mm_filemap_add_to_page_cache: dev 0:62 ino 1002245af9b page=0x7daf4c pfn=0x7daf4c ofs=1048576 > ffffffff9aca933a __add_to_page_cache_locked+0x2aa ([kernel.kallsyms]) > ffffffff9aca933a __add_to_page_cache_locked+0x2aa ([kernel.kallsyms]) > ffffffff9aca945d add_to_page_cache_lru+0x4d ([kernel.kallsyms]) > ffffffff9acb66d8 readahead_expand+0x128 ([kernel.kallsyms]) > ffffffffc0e68fbc netfs_rreq_expand+0x8c ([kernel.kallsyms]) > ffffffffc0e6a6c2 netfs_readahead+0xf2 ([kernel.kallsyms]) > ffffffffc104817c ceph_readahead+0xbc ([kernel.kallsyms]) > ffffffff9acb63c5 read_pages+0x95 ([kernel.kallsyms]) > ffffffff9acb6921 page_cache_ra_unbounded+0x161 ([kernel.kallsyms]) > ffffffff9acb6a1d do_page_cache_ra+0x3d ([kernel.kallsyms]) > ffffffff9acb6b67 ondemand_readahead+0x137 ([kernel.kallsyms]) > ffffffff9acb700f page_cache_sync_ra+0xcf ([kernel.kallsyms]) > ffffffff9acab80c filemap_get_pages+0xdc ([kernel.kallsyms]) > ffffffff9acabe4e filemap_read+0xbe ([kernel.kallsyms]) > ffffffff9acac285 generic_file_read_iter+0xe5 ([kernel.kallsyms]) > ffffffffc1041b82 ceph_read_iter+0x182 ([kernel.kallsyms]) > ffffffff9ad82bf0 new_sync_read+0x110 ([kernel.kallsyms]) > ffffffff9ad83432 vfs_read+0x102 ([kernel.kallsyms]) > ffffffff9ad858d7 ksys_read+0x67 ([kernel.kallsyms]) > ffffffff9ad8597a __x64_sys_read+0x1a ([kernel.kallsyms]) > ffffffff9b76563c do_syscall_64+0x5c ([kernel.kallsyms]) > ffffffff9b800099 entry_SYSCALL_64_after_hwframe+0x61 ([kernel.kallsyms]) > 7fad6ca683cc __libc_read+0x4c (/lib/x86_64-linux-gnu/libpthread-2.31.so) > > The readahead is expanded too much. > > > fs/ceph/addr.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c > index 6bb251a4d613..d508901d3739 100644 > --- a/fs/ceph/addr.c > +++ b/fs/ceph/addr.c > @@ -197,6 +197,8 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq) > > /* Now, round up the length to the next block */ > rreq->len = roundup(rreq->len, lo->stripe_unit); > + /* But do not exceed the file size */ > + rreq->len = min(rreq->len, (size_t)(rreq->i_size - rreq->start)); > } > > static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 6bb251a4d613..d508901d3739 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -197,6 +197,8 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq) /* Now, round up the length to the next block */ rreq->len = roundup(rreq->len, lo->stripe_unit); + /* But do not exceed the file size */ + rreq->len = min(rreq->len, (size_t)(rreq->i_size - rreq->start)); } static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)
Currently, `ceph_netfs_expand_readahead()` tries to align the read request with strip_unit, which by default is set to 4MB. This means that small files will require at least 4MB of page cache, leading to inefficient usage of the page cache. Bound `rreq->len` to the actual file size to restore the previous page cache usage. Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead") Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn> --- We recently updated our kernel. And we are investigating the performance regression on our machine learning jobs. For example, one of our jobs repeatedly read a dataset of 62GB, 100k files. I expect all these IO request would hit the page cache, since we have more that 100GB memory for cache. However, a lot of network IO is observed, and our HDD ceph cluster is fully loaded, resulting in very bad performance. The regression is bisected to commit 49870056005c ("ceph: convert ceph_readpages to ceph_readahead"). This commit is merged in kernel 5.13. After this commit, we need 400GB of memory to fully cache these 100k files, which is unacceptable. The post-EOF page cache is populated at: (gathered by `perf record -a -e filemap:mm_filemap_add_to_page_cache -g sleep 2`) python 3619706 [005] 3103609.736344: filemap:mm_filemap_add_to_page_cache: dev 0:62 ino 1002245af9b page=0x7daf4c pfn=0x7daf4c ofs=1048576 ffffffff9aca933a __add_to_page_cache_locked+0x2aa ([kernel.kallsyms]) ffffffff9aca933a __add_to_page_cache_locked+0x2aa ([kernel.kallsyms]) ffffffff9aca945d add_to_page_cache_lru+0x4d ([kernel.kallsyms]) ffffffff9acb66d8 readahead_expand+0x128 ([kernel.kallsyms]) ffffffffc0e68fbc netfs_rreq_expand+0x8c ([kernel.kallsyms]) ffffffffc0e6a6c2 netfs_readahead+0xf2 ([kernel.kallsyms]) ffffffffc104817c ceph_readahead+0xbc ([kernel.kallsyms]) ffffffff9acb63c5 read_pages+0x95 ([kernel.kallsyms]) ffffffff9acb6921 page_cache_ra_unbounded+0x161 ([kernel.kallsyms]) ffffffff9acb6a1d do_page_cache_ra+0x3d ([kernel.kallsyms]) ffffffff9acb6b67 ondemand_readahead+0x137 ([kernel.kallsyms]) ffffffff9acb700f page_cache_sync_ra+0xcf ([kernel.kallsyms]) ffffffff9acab80c filemap_get_pages+0xdc ([kernel.kallsyms]) ffffffff9acabe4e filemap_read+0xbe ([kernel.kallsyms]) ffffffff9acac285 generic_file_read_iter+0xe5 ([kernel.kallsyms]) ffffffffc1041b82 ceph_read_iter+0x182 ([kernel.kallsyms]) ffffffff9ad82bf0 new_sync_read+0x110 ([kernel.kallsyms]) ffffffff9ad83432 vfs_read+0x102 ([kernel.kallsyms]) ffffffff9ad858d7 ksys_read+0x67 ([kernel.kallsyms]) ffffffff9ad8597a __x64_sys_read+0x1a ([kernel.kallsyms]) ffffffff9b76563c do_syscall_64+0x5c ([kernel.kallsyms]) ffffffff9b800099 entry_SYSCALL_64_after_hwframe+0x61 ([kernel.kallsyms]) 7fad6ca683cc __libc_read+0x4c (/lib/x86_64-linux-gnu/libpthread-2.31.so) The readahead is expanded too much. fs/ceph/addr.c | 2 ++ 1 file changed, 2 insertions(+)