diff mbox series

[rdma-next] IB/core: Add option to limit user mad receive list

Message ID 70029b5f256fbad6efbb98458deb9c46baa2c4b3.1712051390.git.leon@kernel.org (mailing list archive)
State Changes Requested
Headers show
Series [rdma-next] IB/core: Add option to limit user mad receive list | expand

Commit Message

Leon Romanovsky April 2, 2024, 9:50 a.m. UTC
From: Michael Guralnik <michaelgur@nvidia.com>

ib_umad is keeping the received MAD packets in a list that is not
limited in size. As the extraction of packets from this list is done
from user-space application, there is no way to guarantee the extraction
rate to be faster than the rate of incoming packets. This can cause to
the list to grow uncontrollably.

As a solution, let's add new ysfs control knob for the users to limit
the number of received MAD packets in the list.

Packets received when the list is full would be dropped. Sent packets
that are queued on the receive list for whatever reason, like timed out
sends, are not dropped even when the list is full.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../ABI/stable/sysfs-class-infiniband         | 12 ++++
 drivers/infiniband/core/user_mad.c            | 63 ++++++++++++++++++-
 2 files changed, 72 insertions(+), 3 deletions(-)

Comments

Jason Gunthorpe April 4, 2024, 2:01 p.m. UTC | #1
On Tue, Apr 02, 2024 at 12:50:21PM +0300, Leon Romanovsky wrote:
> From: Michael Guralnik <michaelgur@nvidia.com>
> 
> ib_umad is keeping the received MAD packets in a list that is not
> limited in size. As the extraction of packets from this list is done
> from user-space application, there is no way to guarantee the extraction
> rate to be faster than the rate of incoming packets. This can cause to
> the list to grow uncontrollably.
> 
> As a solution, let's add new ysfs control knob for the users to limit
> the number of received MAD packets in the list.
> 
> Packets received when the list is full would be dropped. Sent packets
> that are queued on the receive list for whatever reason, like timed out
> sends, are not dropped even when the list is full.
> 
> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  .../ABI/stable/sysfs-class-infiniband         | 12 ++++
>  drivers/infiniband/core/user_mad.c            | 63 ++++++++++++++++++-
>  2 files changed, 72 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/ABI/stable/sysfs-class-infiniband b/Documentation/ABI/stable/sysfs-class-infiniband
> index 694f23a03a28..0ea9d590ab0e 100644
> --- a/Documentation/ABI/stable/sysfs-class-infiniband
> +++ b/Documentation/ABI/stable/sysfs-class-infiniband
> @@ -275,6 +275,18 @@ Description:
>  		=============== ===========================================
>  
>  
> +What:		/sys/class/infiniband_mad/umad<N>/max_recv_list_size
> +Date:		January, 2024
> +KernelVersion:	v6.9
> +Contact:	linux-rdma@vger.kernel.org
> +Description:
> +		(RW) Limit the size of the list of MAD packets waiting to be
> +		     read by the user-space agent.
> +		     The default value is 0, which means unlimited list size.
> +		     Packets received when the list is full will be silently
> +		     dropped.

I'm really not keen on this as a tunable, when we get to future
designs it may be hard to retain this specific behavior.

Why do we need a tunable? Can we just set it to something large and be
done with it?

Jason
Leon Romanovsky April 4, 2024, 4:51 p.m. UTC | #2
On Thu, Apr 04, 2024 at 11:01:13AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 12:50:21PM +0300, Leon Romanovsky wrote:
> > From: Michael Guralnik <michaelgur@nvidia.com>
> > 
> > ib_umad is keeping the received MAD packets in a list that is not
> > limited in size. As the extraction of packets from this list is done
> > from user-space application, there is no way to guarantee the extraction
> > rate to be faster than the rate of incoming packets. This can cause to
> > the list to grow uncontrollably.
> > 
> > As a solution, let's add new ysfs control knob for the users to limit
> > the number of received MAD packets in the list.
> > 
> > Packets received when the list is full would be dropped. Sent packets
> > that are queued on the receive list for whatever reason, like timed out
> > sends, are not dropped even when the list is full.
> > 
> > Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  .../ABI/stable/sysfs-class-infiniband         | 12 ++++
> >  drivers/infiniband/core/user_mad.c            | 63 ++++++++++++++++++-
> >  2 files changed, 72 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/ABI/stable/sysfs-class-infiniband b/Documentation/ABI/stable/sysfs-class-infiniband
> > index 694f23a03a28..0ea9d590ab0e 100644
> > --- a/Documentation/ABI/stable/sysfs-class-infiniband
> > +++ b/Documentation/ABI/stable/sysfs-class-infiniband
> > @@ -275,6 +275,18 @@ Description:
> >  		=============== ===========================================
> >  
> >  
> > +What:		/sys/class/infiniband_mad/umad<N>/max_recv_list_size
> > +Date:		January, 2024
> > +KernelVersion:	v6.9
> > +Contact:	linux-rdma@vger.kernel.org
> > +Description:
> > +		(RW) Limit the size of the list of MAD packets waiting to be
> > +		     read by the user-space agent.
> > +		     The default value is 0, which means unlimited list size.
> > +		     Packets received when the list is full will be silently
> > +		     dropped.
> 
> I'm really not keen on this as a tunable, when we get to future
> designs it may be hard to retain this specific behavior.
> 
> Why do we need a tunable? Can we just set it to something large and be
> done with it?

I don't know which value to set to be large enough from one side and
small enough to do not cause to OOM while host gets MAD packets.

Thanks

> 
> Jason
>
Jason Gunthorpe April 5, 2024, 12:58 p.m. UTC | #3
On Thu, Apr 04, 2024 at 07:51:03PM +0300, Leon Romanovsky wrote:
> On Thu, Apr 04, 2024 at 11:01:13AM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 02, 2024 at 12:50:21PM +0300, Leon Romanovsky wrote:
> > > From: Michael Guralnik <michaelgur@nvidia.com>
> > > 
> > > ib_umad is keeping the received MAD packets in a list that is not
> > > limited in size. As the extraction of packets from this list is done
> > > from user-space application, there is no way to guarantee the extraction
> > > rate to be faster than the rate of incoming packets. This can cause to
> > > the list to grow uncontrollably.
> > > 
> > > As a solution, let's add new ysfs control knob for the users to limit
> > > the number of received MAD packets in the list.
> > > 
> > > Packets received when the list is full would be dropped. Sent packets
> > > that are queued on the receive list for whatever reason, like timed out
> > > sends, are not dropped even when the list is full.
> > > 
> > > Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> > >  .../ABI/stable/sysfs-class-infiniband         | 12 ++++
> > >  drivers/infiniband/core/user_mad.c            | 63 ++++++++++++++++++-
> > >  2 files changed, 72 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/Documentation/ABI/stable/sysfs-class-infiniband b/Documentation/ABI/stable/sysfs-class-infiniband
> > > index 694f23a03a28..0ea9d590ab0e 100644
> > > --- a/Documentation/ABI/stable/sysfs-class-infiniband
> > > +++ b/Documentation/ABI/stable/sysfs-class-infiniband
> > > @@ -275,6 +275,18 @@ Description:
> > >  		=============== ===========================================
> > >  
> > >  
> > > +What:		/sys/class/infiniband_mad/umad<N>/max_recv_list_size
> > > +Date:		January, 2024
> > > +KernelVersion:	v6.9
> > > +Contact:	linux-rdma@vger.kernel.org
> > > +Description:
> > > +		(RW) Limit the size of the list of MAD packets waiting to be
> > > +		     read by the user-space agent.
> > > +		     The default value is 0, which means unlimited list size.
> > > +		     Packets received when the list is full will be silently
> > > +		     dropped.
> > 
> > I'm really not keen on this as a tunable, when we get to future
> > designs it may be hard to retain this specific behavior.
> > 
> > Why do we need a tunable? Can we just set it to something large and be
> > done with it?
> 
> I don't know which value to set to be large enough from one side and
> small enough to do not cause to OOM while host gets MAD packets.

I'd say something like 32M worth of packets is probably good from both
directions? Or 1% of system memory, or something like that.

Jason
diff mbox series

Patch

diff --git a/Documentation/ABI/stable/sysfs-class-infiniband b/Documentation/ABI/stable/sysfs-class-infiniband
index 694f23a03a28..0ea9d590ab0e 100644
--- a/Documentation/ABI/stable/sysfs-class-infiniband
+++ b/Documentation/ABI/stable/sysfs-class-infiniband
@@ -275,6 +275,18 @@  Description:
 		=============== ===========================================
 
 
+What:		/sys/class/infiniband_mad/umad<N>/max_recv_list_size
+Date:		January, 2024
+KernelVersion:	v6.9
+Contact:	linux-rdma@vger.kernel.org
+Description:
+		(RW) Limit the size of the list of MAD packets waiting to be
+		     read by the user-space agent.
+		     The default value is 0, which means unlimited list size.
+		     Packets received when the list is full will be silently
+		     dropped.
+
+
 What:		/sys/class/infiniband_verbs/abi_version
 Date:		Sep, 2005
 KernelVersion:	v2.6.14
diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index f5feca7fa9b9..96fe54cd4c8a 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -102,6 +102,7 @@  struct ib_umad_port {
 	struct ib_umad_device *umad_dev;
 	int                    dev_num;
 	u32                     port_num;
+	int			max_recv_list_size;
 };
 
 struct ib_umad_device {
@@ -113,6 +114,7 @@  struct ib_umad_file {
 	struct mutex		mutex;
 	struct ib_umad_port    *port;
 	struct list_head	recv_list;
+	atomic_t		recv_list_size;
 	struct list_head	send_list;
 	struct list_head	port_list;
 	spinlock_t		send_lock;
@@ -180,9 +182,27 @@  static struct ib_mad_agent *__get_agent(struct ib_umad_file *file, int id)
 	return file->agents_dead ? NULL : file->agent[id];
 }
 
+static inline bool should_drop_packet(struct ib_umad_file *file,
+				      struct ib_umad_packet *packet,
+				      bool is_send_mad)
+{
+	if (is_send_mad)
+		return false;
+
+	if (!file->port->max_recv_list_size)
+		return false;
+
+	if (atomic_read(&file->recv_list_size) <
+	    file->port->max_recv_list_size)
+		return false;
+
+	return true;
+}
+
 static int queue_packet(struct ib_umad_file *file,
 			struct ib_mad_agent *agent,
-			struct ib_umad_packet *packet)
+			struct ib_umad_packet *packet,
+			bool is_send_mad)
 {
 	int ret = 1;
 
@@ -192,7 +212,10 @@  static int queue_packet(struct ib_umad_file *file,
 	     packet->mad.hdr.id < IB_UMAD_MAX_AGENTS;
 	     packet->mad.hdr.id++)
 		if (agent == __get_agent(file, packet->mad.hdr.id)) {
+			if (should_drop_packet(file, packet, is_send_mad))
+				break;
 			list_add_tail(&packet->list, &file->recv_list);
+			atomic_inc(&file->recv_list_size);
 			wake_up_interruptible(&file->recv_wait);
 			ret = 0;
 			break;
@@ -224,7 +247,7 @@  static void send_handler(struct ib_mad_agent *agent,
 	if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) {
 		packet->length = IB_MGMT_MAD_HDR;
 		packet->mad.hdr.status = ETIMEDOUT;
-		if (!queue_packet(file, agent, packet))
+		if (!queue_packet(file, agent, packet, true))
 			return;
 	}
 	kfree(packet);
@@ -284,7 +307,7 @@  static void recv_handler(struct ib_mad_agent *agent,
 		rdma_destroy_ah_attr(&ah_attr);
 	}
 
-	if (queue_packet(file, agent, packet))
+	if (queue_packet(file, agent, packet, false))
 		goto err2;
 	return;
 
@@ -409,6 +432,7 @@  static ssize_t ib_umad_read(struct file *filp, char __user *buf,
 
 	packet = list_entry(file->recv_list.next, struct ib_umad_packet, list);
 	list_del(&packet->list);
+	atomic_dec(&file->recv_list_size);
 
 	mutex_unlock(&file->mutex);
 
@@ -421,6 +445,7 @@  static ssize_t ib_umad_read(struct file *filp, char __user *buf,
 		/* Requeue packet */
 		mutex_lock(&file->mutex);
 		list_add(&packet->list, &file->recv_list);
+		atomic_inc(&file->recv_list_size);
 		mutex_unlock(&file->mutex);
 	} else {
 		if (packet->recv_wc)
@@ -1222,9 +1247,41 @@  static ssize_t port_show(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RO(port);
 
+static ssize_t max_recv_list_size_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct ib_umad_port *port = dev_get_drvdata(dev);
+	int val, ret;
+
+	if (!port)
+		return -ENODEV;
+
+	ret = kstrtouint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	port->max_recv_list_size = val;
+
+	return count;
+}
+
+static ssize_t max_recv_list_size_show(struct device *dev,
+				       struct device_attribute *attr, char *buf)
+{
+	struct ib_umad_port *port = dev_get_drvdata(dev);
+
+	if (!port)
+		return -ENODEV;
+
+	return sysfs_emit(buf, "%d\n", port->max_recv_list_size);
+}
+static DEVICE_ATTR_RW(max_recv_list_size);
+
 static struct attribute *umad_class_dev_attrs[] = {
 	&dev_attr_ibdev.attr,
 	&dev_attr_port.attr,
+	&dev_attr_max_recv_list_size.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(umad_class_dev);