mbox series

[RFC,v2,0/7] On-Demand Paging on SoftRoCE

Message ID cover.1668157436.git.matsuda-daisuke@fujitsu.com (mailing list archive)
Headers show
Series On-Demand Paging on SoftRoCE | expand

Message

Daisuke Matsuda (Fujitsu) Nov. 11, 2022, 9:22 a.m. UTC
This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
driver, which has been available only in mlx5 driver[1] so far.

[Overview]
When applications register a memory region(MR), RDMA drivers normally pin
pages in the MR so that physical addresses are never changed during RDMA
communication. This requires the MR to fit in physical memory and
inevitably leads to memory pressure. On the other hand, On-Demand Paging
(ODP) allows applications to register MRs without pinning pages. They are
paged-in when the driver requires and paged-out when the OS reclaims. As a
result, it is possible to register a large MR that does not fit in physical
memory without taking up so much physical memory.

[Why to add this feature?]
We, Fujitsu, have contributed to RDMA with a view to using it with
persistent memory. Persistent memory can host a filesystem that allows
applications to read/write files directly without involving page cache.
This is called FS-DAX(filesystem direct access) mode. There is a problem
that data on DAX-enabled filesystem cannot be duplicated with software RAID
or other hardware methods. Data replication with RDMA, which features
high-speed connections, is the best solution for the problem.

However, there is a known issue that hinders using RDMA with FS-DAX. When
RDMA operations to a file and update of the file metadata are processed
concurrently on the same node, illegal memory accesses can be executed,
disregarding the updated metadata. This is because RDMA operations do not
go through page cache but access data directly. There was an effort[2] to
solve this problem, but it was rejected in the end. Though there is no
general solution available, it is possible to work around the problem using
the ODP feature. It enables the kernel driver to update metadata before
processing RDMA operations.

We have enhanced the rxe to expedite the usage of persistent memory. Our
contribution to rxe includes RDMA Atomic write[3] and RDMA Flush[4]. With
them being merged along with ODP, an environment will be ready for
developers to create and test software for RDMA with FS-DAX. There is a
library(librpma)[5] being developed for this purpose. This environment
can be used by anybody without any special hardware but an ordinary
computer with a normal NIC though it is inferior to hardware
implementations in terms of performance.

[Design considerations]
ODP has been available only in mlx5, but functions and data structures
that can be used commonly are provided in ib_uverbs(infiniband/core). The
interface is heavily dependent on HMM infrastructure[6], and this patchset
use them as much as possible. While mlx5 has both Explicit and Implicit ODP
features along with prefetch feature, this patchset implements the Explicit
ODP feature only.

As an important change, it is necessary to convert triple tasklets
(requester, responder and completer) to workqueues because they must be
able to sleep in order to trigger page fault before accessing MRs. There
have been some discussions, and Bob Pearson thankfully posted patches[7]
to do this conversion. A large part of my 2nd patch will be dropped
because Bob's workqueue implementation is likely to be adopted. However,
I will have to modify rxe_comp_queue_pkt() and rxe_resp_queue_pkt() to
schedule works for work items that access user MRs after all.

If responder and completer sleep, it becomes more likely that packet drop
occurs because of overflow in receiver queue. There are multiple queues
involved, but, as SoftRoCE uses UDP, the most important one would be the
UDP buffers. The size can be configured in net.core.rmem_default and
net.core.rmem_max sysconfig parameters. Users should change these values in
case of packet drop, but page fault would be typically not so long as to
cause the problem.

[How does ODP work?]
"struct ib_umem_odp" is used to manage pages. It is created for each
ODP-enabled MR on its registration. This struct holds a pair of arrays
(dma_list/pfn_list) that serve as a driver page table. DMA addresses and
PFNs are stored in the driver page table. They are updated on page-in and
page-out, both of which use the common interfaces in ib_uverbs.

Page-in can occur when requester, responder or completer access an MR in
order to process RDMA operations. If they find that the pages being
accessed are not present on physical memory or requisite permissions are
not set on the pages, they provoke page fault to make pages present with
proper permissions and at the same time update the driver page table. After
confirming the presence of the pages, they execute memory access such as
read, write or atomic operations.

Page-out is triggered by page reclaim or filesystem events (e.g. metadata
update of a file that is being used as an MR). When creating an ODP-enabled
MR, the driver registers an MMU notifier callback. When the kernel issues a
page invalidation notification, the callback is provoked to unmap DMA
addresses and update the driver page table. After that, the kernel releases
the pages.

[Supported operations]
All operations are supported on RC connection. Atomic write[3] and Flush[4]
operations, which are still under review, are also going to be supported
after these patches are merged. On UD connection, Send, Recv, SRQ-Recv are
supported.

[How to test ODP?]
There are only a few resources available for testing. pyverbs testcases in
rdma-core and perftest[8] are recommendable ones. I posted a patchset[9] to
expand pyverbs testcases, but they are not merged as of now. Other than
them, the ibv_rc_pingpong command can also used for testing. Note that you
may have to build perftest from upstream since older versions do not handle
ODP capabilities correctly.

The tree is available from the URL below:
https://github.com/daimatsuda/linux/tree/odp_rfc_v2

[Future work]
My next work will be the prefetch feature. It allows applications to
trigger page fault using ibv_advise_mr(3) to optimize performance. Some
existing software like librpma use this feature. Additionally, I think we
can also add the implicit ODP feature in the future.

[1] [RFC 00/20] On demand paging
https://www.spinics.net/lists/linux-rdma/msg18906.html

[2] [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
https://lore.kernel.org/nvdimm/20190809225833.6657-1-ira.weiny@intel.com/

[3] [PATCH v6 0/8] RDMA/rxe: Add atomic write operation
https://lore.kernel.org/all/20221015063648.52285-1-yangx.jy@fujitsu.com/

[4] [for-next PATCH v5 00/11] RDMA/rxe: Add RDMA FLUSH operation
https://lore.kernel.org/lkml/20220927055337.22630-12-lizhijian@fujitsu.com/t/

[5] librpma: Remote Persistent Memory Access Library
https://github.com/pmem/rpma

[6] Heterogeneous Memory Management (HMM)
https://www.kernel.org/doc/html/latest/mm/hmm.html

[7] [PATCH for-next v3 00/13] Implement work queues for rdma_rxe
https://lore.kernel.org/linux-rdma/20221029031009.64467-1-rpearsonhpe@gmail.com/

[8] linux-rdma/perftest: Infiniband Verbs Performance Tests
https://github.com/linux-rdma/perftest

[9] tests: ODP testcases for RDMA Write/Read and Atomic operations #1229
https://github.com/linux-rdma/rdma-core/pull/1229

v1->v2:
 1) Fixed a crash issue reported by Haris Iqbal.
 2) Tried to make lock patters clearer as pointed out by Romanovsky.
 3) Minor clean ups and fixes.

Daisuke Matsuda (7):
  IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex
  RDMA/rxe: Convert the triple tasklets to workqueues
  RDMA/rxe: Cleanup code for responder Atomic operations
  RDMA/rxe: Add page invalidation support
  RDMA/rxe: Allow registering MRs for On-Demand Paging
  RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP
  RDMA/rxe: Add support for the traditional Atomic operations with ODP

 drivers/infiniband/core/umem_odp.c    |   8 +-
 drivers/infiniband/hw/mlx5/odp.c      |   4 +-
 drivers/infiniband/sw/rxe/Makefile    |   5 +-
 drivers/infiniband/sw/rxe/rxe.c       |  18 ++
 drivers/infiniband/sw/rxe/rxe_comp.c  |  42 +++-
 drivers/infiniband/sw/rxe/rxe_loc.h   |  13 +-
 drivers/infiniband/sw/rxe/rxe_mr.c    |   7 +-
 drivers/infiniband/sw/rxe/rxe_net.c   |   4 +-
 drivers/infiniband/sw/rxe/rxe_odp.c   | 336 ++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_param.h |   2 +-
 drivers/infiniband/sw/rxe/rxe_qp.c    |  71 +++---
 drivers/infiniband/sw/rxe/rxe_recv.c  |   4 +-
 drivers/infiniband/sw/rxe/rxe_req.c   |  14 +-
 drivers/infiniband/sw/rxe/rxe_resp.c  | 185 +++++++-------
 drivers/infiniband/sw/rxe/rxe_resp.h  |  44 ++++
 drivers/infiniband/sw/rxe/rxe_verbs.c |  16 +-
 drivers/infiniband/sw/rxe/rxe_verbs.h |  10 +-
 drivers/infiniband/sw/rxe/rxe_wq.c    | 160 ++++++++++++
 drivers/infiniband/sw/rxe/rxe_wq.h    |  70 ++++++
 19 files changed, 843 insertions(+), 170 deletions(-)
 create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
 create mode 100644 drivers/infiniband/sw/rxe/rxe_resp.h
 create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
 create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h

base-commit: 4508d32ccced24c972bc4592104513e1ff8439b5

Comments

Leon Romanovsky Nov. 16, 2022, 6:05 p.m. UTC | #1
On Fri, Nov 11, 2022 at 06:22:21PM +0900, Daisuke Matsuda wrote:
> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> driver, which has been available only in mlx5 driver[1] so far.

<...>

> Daisuke Matsuda (7):
>   IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex
>   RDMA/rxe: Convert the triple tasklets to workqueues
>   RDMA/rxe: Cleanup code for responder Atomic operations
>   RDMA/rxe: Add page invalidation support
>   RDMA/rxe: Allow registering MRs for On-Demand Paging
>   RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP
>   RDMA/rxe: Add support for the traditional Atomic operations with ODP

It is a shame that such cool feature is not progressing.
RXE folks, can you please review it?

Thanks
Daisuke Matsuda (Fujitsu) Nov. 18, 2022, 10:03 a.m. UTC | #2
On Fri, Nov 18, 2022 5:34 PM Hillf Danton wrote:
Hi Hillf,

Thank you for taking a look.

As I wrote in the cover letter, a large part of this patch shall be temporary,
and Bob Pearson's workqueue implementation is likely to be adopted instead
unless there are any problems with it.
[PATCH for-next v3 00/13] Implement work queues for rdma_rxe
Cf. https://lore.kernel.org/linux-rdma/20221029031009.64467-1-rpearsonhpe@gmail.com/

I appreciate your insightful comments. If his workqueue is rejected in the end,
then I will fix them for submission. Otherwise, I am going to rebase my work
onto his patches in the next version.

Thanks,
Daisuke

> On 11 Nov 2022 18:22:23 +0900 Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> > +/*
> > + * this locking is due to a potential race where
> > + * a second caller finds the work already running
> > + * but looks just after the last call to func
> > + */
> > +void rxe_do_work(struct work_struct *w)
> > +{
> > +	int cont;
> > +	int ret;
> > +
> > +	struct rxe_work *work = container_of(w, typeof(*work), work);
> > +	unsigned int iterations = RXE_MAX_ITERATIONS;
> > +
> > +	spin_lock_bh(&work->state_lock);
> > +	switch (work->state) {
> > +	case WQ_STATE_START:
> > +		work->state = WQ_STATE_BUSY;
> > +		spin_unlock_bh(&work->state_lock);
> > +		break;
> > +
> > +	case WQ_STATE_BUSY:
> > +		work->state = WQ_STATE_ARMED;
> > +		fallthrough;
> > +	case WQ_STATE_ARMED:
> > +		spin_unlock_bh(&work->state_lock);
> > +		return;
> > +
> > +	default:
> > +		spin_unlock_bh(&work->state_lock);
> > +		pr_warn("%s failed with bad state %d\n", __func__, work->state);
> > +		return;
> > +	}
> > +
> > +	do {
> > +		cont = 0;
> > +		ret = work->func(work->arg);
> > +
> > +		spin_lock_bh(&work->state_lock);
> > +		switch (work->state) {
> > +		case WQ_STATE_BUSY:
> > +			if (ret) {
> > +				work->state = WQ_STATE_START;
> > +			} else if (iterations--) {
> > +				cont = 1;
> > +			} else {
> > +				/* reschedule the work and exit
> > +				 * the loop to give up the cpu
> > +				 */
> 
> Unlike tasklet, workqueue work is unable to be a CPU hog with PREEMPT
> enabled, otherwise cond_resched() is enough.
> 
> > +				queue_work(work->worker, &work->work);
> 
> Nit, s/worker/workq/ for example as worker, work and workqueue are
> different things in the domain of WQ.
> 
> > +				work->state = WQ_STATE_START;
> > +			}
> > +			break;
> > +
> > +		/* someone tried to run the work since the last time we called
> > +		 * func, so we will call one more time regardless of the
> > +		 * return value
> > +		 */
> > +		case WQ_STATE_ARMED:
> > +			work->state = WQ_STATE_BUSY;
> > +			cont = 1;
> > +			break;
> > +
> > +		default:
> > +			pr_warn("%s failed with bad state %d\n", __func__,
> > +				work->state);
> > +		}
> > +		spin_unlock_bh(&work->state_lock);
> > +	} while (cont);
> > +
> > +	work->ret = ret;
> > +}
> > +
> [...]
> > +void rxe_run_work(struct rxe_work *work, int sched)
> > +{
> > +	if (work->destroyed)
> > +		return;
> > +
> > +	/* busy-loop while qp reset is in progress */
> > +	while (atomic_read(&work->suspended))
> > +		continue;
> 
> Feel free to add a one-line comment specifying the reasons for busy loop
> instead of taking a nap, given it may take two seconds to flush WQ.
> 
> > +
> > +	if (sched)
> > +		queue_work(work->worker, &work->work);
> > +	else
> > +		rxe_do_work(&work->work);
> > +}
> > +
> > +void rxe_disable_work(struct rxe_work *work)
> > +{
> > +	atomic_inc(&work->suspended);
> > +	flush_workqueue(work->worker);
> > +}