diff mbox series

[RFC] net: smc: fix fasync leak in smc_release()

Message ID 20240221051608.43241-1-dmantipov@yandex.ru (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC] net: smc: fix fasync leak in smc_release() | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 940 this patch: 945
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 6 maintainers not CCed: pabeni@redhat.com guwen@linux.alibaba.com kuba@kernel.org edumazet@google.com alibuda@linux.alibaba.com tonylu@linux.alibaba.com
netdev/build_clang fail Errors and warnings before: 957 this patch: 962
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 957 this patch: 962
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 15 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Dmitry Antipov Feb. 21, 2024, 5:16 a.m. UTC
I've tracked https://syzkaller.appspot.com/bug?extid=5f1acda7e06a2298fae6
down to the problem which may be illustrated by the following pseudocode:

int sock;

/* thread 1 */

while (1) {
       struct msghdr msg = { ... };
       sock = socket(AF_SMC, SOCK_STREAM, 0);
       sendmsg(sock, &msg, MSG_FASTOPEN);
       close(sock);
}

/* thread 2 */

while (1) {
       int on = 1;
       ioctl(sock, FIOASYNC, &on);
       on = 0;
       ioctl(sock, FIOASYNC, &on);
}

That is, something in thread 1 may cause 'smc_switch_to_fallback()' and
swap kernel sockets (of 'struct smc_sock') behind 'sock' between 'ioctl()'
calls in thread 2, so this becomes an attempt to add fasync entry to one
socket but remove from another one. When 'sock' is closing, '__fput()'
calls 'f_op->fasync()' _before_ 'f_op->release()', and it's too late to
revert the trick performed by 'smc_switch_to_fallback()' in 'smc_release()'
and below. Finally we end up with leaked 'struct fasync_struct' object
linked to the base socket, and this object is noticed by '__sock_release()'
("fasync list not empty"). Of course using 'fasync_remove_entry()' in such
a way is extremely ugly, but what else we can do without touching generic
socket code, '__fput()', etc.? Comments are highly appreciated.

Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
---
 net/smc/af_smc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Wen Gu Feb. 21, 2024, 1:09 p.m. UTC | #1
On 2024/2/21 13:16, Dmitry Antipov wrote:
> I've tracked https://syzkaller.appspot.com/bug?extid=5f1acda7e06a2298fae6
> down to the problem which may be illustrated by the following pseudocode:
> 
> int sock;
> 
> /* thread 1 */
> 
> while (1) {
>         struct msghdr msg = { ... };
>         sock = socket(AF_SMC, SOCK_STREAM, 0);
>         sendmsg(sock, &msg, MSG_FASTOPEN);
>         close(sock);
> }
> 
> /* thread 2 */
> 
> while (1) {
>         int on = 1;
>         ioctl(sock, FIOASYNC, &on);
>         on = 0;
>         ioctl(sock, FIOASYNC, &on);
> }
> 
> That is, something in thread 1 may cause 'smc_switch_to_fallback()' and
> swap kernel sockets (of 'struct smc_sock') behind 'sock' between 'ioctl()'
> calls in thread 2, so this becomes an attempt to add fasync entry to one
> socket but remove from another one. When 'sock' is closing, '__fput()'
> calls 'f_op->fasync()' _before_ 'f_op->release()', and it's too late to
> revert the trick performed by 'smc_switch_to_fallback()' in 'smc_release()'
> and below. Finally we end up with leaked 'struct fasync_struct' object
> linked to the base socket, and this object is noticed by '__sock_release()'
> ("fasync list not empty"). Of course using 'fasync_remove_entry()' in such
> a way is extremely ugly, but what else we can do without touching generic
> socket code, '__fput()', etc.? Comments are highly appreciated.
> 

Hi, Dmitry. Just to confirm if I understand correctly:

1. on = 1; ioctl(sock, FIOASYNC, &on), a fasync entry is added to
    smc->sk.sk_socket->wq.fasync_list;

2. Then fallback happend, and swapped the socket:
    smc->clcsock->file = smc->sk.sk_socket->file;
    smc->clcsock->file->private_data = smc->clcsock;
    smc->clcsock->wq.fasync_list = smc->sk.sk_socket->wq.fasync_list;
    smc->sk.sk_socket->wq.fasync_list = NULL;

3. on = 0; ioctl(sock, FIOASYNC, &on), the fasync entry is removed
    from smc->clcsock->wq.fasync_list,
(Is there a race between 2 and 3 ?)

4. Then close the file, __fput() calls file->f_op->fasync(-1, file, 0),
    then sock_fasync() calls fasync_helper(fd, filp, on, &wq->fasync_list)
    and fasync_remove_entry() removes entries in smc->clcsock->wq.fasync_list.
    Now smc->clcsock->wq.fasync_list is empty.

5. __fput() calls file->f_op->release(inode, file), then sock_close calls
    __sock_release, then ops->release calls smc_release(), and __smc_release()
    calls smc_restore_fallback_changes() to restore socket:
    if (smc->clcsock->file) { /* non-accepted sockets have no file yet */
         smc->clcsock->file->private_data = smc->sk.sk_socket;
         smc->clcsock->file = NULL;
         smc_fback_restore_callbacks(smc);
    }

6. Then back to __sock_release, check if sock->wq.fasync_list (that is
    smc->sk.sk_socket->wq.fasync_list) is empty and it is empty.

So in which step we leaked the fasync_struct entry in smc->sk.sk_socket->wq.fasync_list?
Looks like I missed something, could you please point it to me?

Thanks!

> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
> ---
>   net/smc/af_smc.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 0f53a5c6fd9d..68cde9db5d2f 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -337,9 +337,13 @@ static int smc_release(struct socket *sock)
>   	else
>   		lock_sock(sk);
>   
> -	if (old_state == SMC_INIT && sk->sk_state == SMC_ACTIVE &&
> -	    !smc->use_fallback)
> +	if (smc->use_fallback) {
> +		/* FIXME: ugly and should be done in some other way */
> +		if (sock->wq.fasync_list)
> +			fasync_remove_entry(sock->file, &sock->wq.fasync_list);
> +	} else if (old_state == SMC_INIT && sk->sk_state == SMC_ACTIVE) {
>   		smc_close_active_abort(smc);
> +	}
>   
>   	rc = __smc_release(smc);
>
Antipov, Dmitriy Feb. 21, 2024, 3:02 p.m. UTC | #2
On Wed, 2024-02-21 at 21:09 +0800, Wen Gu wrote:

> 1. on = 1; ioctl(sock, FIOASYNC, &on), a fasync entry is added to
>     smc->sk.sk_socket->wq.fasync_list;
> 
> 2. Then fallback happend, and swapped the socket:
>     smc->clcsock->file = smc->sk.sk_socket->file;
>     smc->clcsock->file->private_data = smc->clcsock;
>     smc->clcsock->wq.fasync_list = smc->sk.sk_socket->wq.fasync_list;
>     smc->sk.sk_socket->wq.fasync_list = NULL;
> 
> 3. on = 0; ioctl(sock, FIOASYNC, &on), the fasync entry is removed
>     from smc->clcsock->wq.fasync_list,
> (Is there a race between 2 and 3 ?)

1) IIUC yes. The following sequence from smc_switch_to_fallback():

smc->clcsock->file = smc->sk.sk_socket->file;
smc->clcsock->file->private_data = smc->clcsock;

is executed with locked smc->sk.sk_socket but unlocked smc->clcsock.
This way,

struct socket *sock = filp->private_data;

in sock_fasync() introduces an undefined behavior (because an
actual value of 'private_data' is unpredictable). So there are
two possible scenarios. When

on = 1; ioctl(sock, FIOASYNC, &on);
on = 0; ioctl(sock, FIOASYNC, &on);

actually modifies fasync list of the same socket, this works as
expected. If kernel sockets behind 'sock' gets swapped between
calls to ioctl(), fasync list of the first socket has an entry
which can't be removed by the second ioctl(). 


> 4. Then close the file, __fput() calls file->f_op->fasync(-1, file, 0),
>     then sock_fasync() calls fasync_helper(fd, filp, on, &wq->fasync_list)
>     and fasync_remove_entry() removes entries in smc->clcsock->wq.fasync_list.
>     Now smc->clcsock->wq.fasync_list is empty.

2) No. In the second (bad) scenario from the above, an attempt to remove
fasync entry from smc->clcsock->wq.fasync_list always fails because
fasync entry was actually linked to smc->sk.sk_socket->wq.fasync_list.

Note sock_fasync() doesn't check the value returned from fasync_helper().
How dumb.

> 5. __fput() calls file->f_op->release(inode, file), then sock_close calls
>     __sock_release, then ops->release calls smc_release(), and __smc_release()
>     calls smc_restore_fallback_changes() to restore socket:
>     if (smc->clcsock->file) { /* non-accepted sockets have no file yet */
>          smc->clcsock->file->private_data = smc->sk.sk_socket;
>          smc->clcsock->file = NULL;
>          smc_fback_restore_callbacks(smc);
>     }

3) Yes. And it's too late because __fput() calls file->f_op->fasync(-1, ...,
0) _before_ file->f_op->release(). So even if you have sockets swapped back,
no one will take care about freeing fasync list.


> 6. Then back to __sock_release, check if sock->wq.fasync_list (that is
>     smc->sk.sk_socket->wq.fasync_list) is empty and it is empty.

4) No. It's not empty because an attempt to remove fasync entry has failed
at [2] just because it was made against the wrong socket. 


For your convenience, there is a human-readable reproducer loosely modeled
around the one generated by syzkaller. You can try it on any system running
recently enough kernel with CONFIG_SMC enabled (root is not required), and
receiving a few (or may be a lot of) "__sock_release: fasync list not empty"
messages clearly indicates an issue. NOTE: this shouldn't crash the system
and/or make it unusable, but remember that each message comes with a small
kernel memory leak.

Dmitry

#include <signal.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/ioctl.h>
#include <sys/socket.h>

int sock;

void *
loop0 (void *arg)
{
  struct msghdr msg = { 0 };

  while (1)
    {
      sock = socket (AF_SMC, SOCK_STREAM, 0);
      sendmsg (sock, &msg, MSG_FASTOPEN);
      close (sock);
    }
  return NULL;
}

void *
loop1 (void *arg)
{
  int on;

  while (1)
    {
      on = 1;
      ioctl (sock, FIOASYNC, &on);
      on = 0;
      ioctl (sock, FIOASYNC, &on);
    }

  return NULL;
}

int
main (int argc, char *argv[])
{
  pthread_t a, b;
  struct sigaction sa = { 0 };

  sa.sa_handler = SIG_IGN;
  sigaction (SIGIO, &sa, NULL);

  pthread_create (&a, NULL, loop0, NULL);
  pthread_create (&b, NULL, loop1, NULL);

  pause ();

  pthread_join (a, NULL);
  pthread_join (b, NULL);

  return 0;
}
Wen Gu Feb. 23, 2024, 3:36 a.m. UTC | #3
On 2024/2/21 23:02, Antipov, Dmitriy wrote:
> On Wed, 2024-02-21 at 21:09 +0800, Wen Gu wrote:
> 
>> 1. on = 1; ioctl(sock, FIOASYNC, &on), a fasync entry is added to
>>      smc->sk.sk_socket->wq.fasync_list;
>>
>> 2. Then fallback happend, and swapped the socket:
>>      smc->clcsock->file = smc->sk.sk_socket->file;
>>      smc->clcsock->file->private_data = smc->clcsock;
>>      smc->clcsock->wq.fasync_list = smc->sk.sk_socket->wq.fasync_list;
>>      smc->sk.sk_socket->wq.fasync_list = NULL;
>>
>> 3. on = 0; ioctl(sock, FIOASYNC, &on), the fasync entry is removed
>>      from smc->clcsock->wq.fasync_list,
>> (Is there a race between 2 and 3 ?)
> 
> 1) IIUC yes. The following sequence from smc_switch_to_fallback():
> 
> smc->clcsock->file = smc->sk.sk_socket->file;
> smc->clcsock->file->private_data = smc->clcsock;
> 
> is executed with locked smc->sk.sk_socket but unlocked smc->clcsock.
> This way,
> 
> struct socket *sock = filp->private_data;
> 
> in sock_fasync() introduces an undefined behavior (because an
> actual value of 'private_data' is unpredictable). So there are
> two possible scenarios. When
> 
> on = 1; ioctl(sock, FIOASYNC, &on);
> on = 0; ioctl(sock, FIOASYNC, &on);
> 
> actually modifies fasync list of the same socket, this works as
> expected. If kernel sockets behind 'sock' gets swapped between
> calls to ioctl(), fasync list of the first socket has an entry
> which can't be removed by the second ioctl().
> 

Thank you for the explanation, Dmitriy.

So the race could be:

sock_fasync                         | smc_switch_to_fallback
------------------------------------------------------------------
/* smc->sk.sk_socket->wq.fasync_list|
is empty at the beginning */        |
                                     |
/* attempt to add fasync_struct */  |
sock = filp->private_data;          |
(smc->sk.sk_socket)                 | /* fallback happens */
                                     | lock_sock(sk);
                                     | file->private_data = smc->clcsock
                                     | smc->clcsock->wq.fasync_list = smc->sk.sk_socket->wq.fasync_list
                                     | smc->sk.sk_socket->wq.fasync_list = NULL
                                     | release_sock(sk);
lock_sock(sk);                      |
fasync_helper(on=1)                 |
(successed to add the entry in      |
smc->sk.sk_socket->wq.fasync_list)  |
release_sock(sk);                   |
                                     |
/* the fasync_struct entry can't be |
removed later, since                |
file->private_data has been changed |
to clcsock */                       |

Now clcsock->wq.fasync_list is empty and the fasync_struct entry is
always left in smc->sk.sk_socket->wq.fasync_list.

If we only remove fasync_struct entries in smc->sk.sk_socket->wq.fasync_list
during smc_release, it indeed solves the leak, but it fails to address
the problem of fasync_struct entries being incorrectly inserted into the
old socket (they should have been placed in smc->clcsock->wq.fasync_list
if fallback happens).

One solution to this issue I can think of is to check whether
filp->private_data has been changed when the sock_fasync holds the sock lock,
but it inevitably changes the general code..

diff --git a/net/socket.c b/net/socket.c
index ed3df2f749bf..a28435195854 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1443,6 +1443,11 @@ static int sock_fasync(int fd, struct file *filp, int on)
                 return -EINVAL;

         lock_sock(sk);
+       /* filp->private_data has changed */
+       if (on && unlikely(sock != filp->private_data)) {
+               release_sock(sk);
+               return -EAGAIN;
+       }
         fasync_helper(fd, filp, on, &wq->fasync_list);

         if (!wq->fasync_list)

Let's see if anyone else has a better idea.

Best regards,
Wen Gu

> 
>> 4. Then close the file, __fput() calls file->f_op->fasync(-1, file, 0),
>>      then sock_fasync() calls fasync_helper(fd, filp, on, &wq->fasync_list)
>>      and fasync_remove_entry() removes entries in smc->clcsock->wq.fasync_list.
>>      Now smc->clcsock->wq.fasync_list is empty.
> 
> 2) No. In the second (bad) scenario from the above, an attempt to remove
> fasync entry from smc->clcsock->wq.fasync_list always fails because
> fasync entry was actually linked to smc->sk.sk_socket->wq.fasync_list.
> 
> Note sock_fasync() doesn't check the value returned from fasync_helper().
> How dumb.
> 
>> 5. __fput() calls file->f_op->release(inode, file), then sock_close calls
>>      __sock_release, then ops->release calls smc_release(), and __smc_release()
>>      calls smc_restore_fallback_changes() to restore socket:
>>      if (smc->clcsock->file) { /* non-accepted sockets have no file yet */
>>           smc->clcsock->file->private_data = smc->sk.sk_socket;
>>           smc->clcsock->file = NULL;
>>           smc_fback_restore_callbacks(smc);
>>      }
> 
> 3) Yes. And it's too late because __fput() calls file->f_op->fasync(-1, ...,
> 0) _before_ file->f_op->release(). So even if you have sockets swapped back,
> no one will take care about freeing fasync list.
> 
> 
>> 6. Then back to __sock_release, check if sock->wq.fasync_list (that is
>>      smc->sk.sk_socket->wq.fasync_list) is empty and it is empty.
> 
> 4) No. It's not empty because an attempt to remove fasync entry has failed
> at [2] just because it was made against the wrong socket.
> 
> 
> For your convenience, there is a human-readable reproducer loosely modeled
> around the one generated by syzkaller. You can try it on any system running
> recently enough kernel with CONFIG_SMC enabled (root is not required), and
> receiving a few (or may be a lot of) "__sock_release: fasync list not empty"
> messages clearly indicates an issue. NOTE: this shouldn't crash the system
> and/or make it unusable, but remember that each message comes with a small
> kernel memory leak.
> 
> Dmitry
> 
> #include <signal.h>
> #include <unistd.h>
> #include <pthread.h>
> #include <sys/ioctl.h>
> #include <sys/socket.h>
> 
> int sock;
> 
> void *
> loop0 (void *arg)
> {
>    struct msghdr msg = { 0 };
> 
>    while (1)
>      {
>        sock = socket (AF_SMC, SOCK_STREAM, 0);
>        sendmsg (sock, &msg, MSG_FASTOPEN);
>        close (sock);
>      }
>    return NULL;
> }
> 
> void *
> loop1 (void *arg)
> {
>    int on;
> 
>    while (1)
>      {
>        on = 1;
>        ioctl (sock, FIOASYNC, &on);
>        on = 0;
>        ioctl (sock, FIOASYNC, &on);
>      }
> 
>    return NULL;
> }
> 
> int
> main (int argc, char *argv[])
> {
>    pthread_t a, b;
>    struct sigaction sa = { 0 };
> 
>    sa.sa_handler = SIG_IGN;
>    sigaction (SIGIO, &sa, NULL);
> 
>    pthread_create (&a, NULL, loop0, NULL);
>    pthread_create (&b, NULL, loop1, NULL);
> 
>    pause ();
> 
>    pthread_join (a, NULL);
>    pthread_join (b, NULL);
> 
>    return 0;
> }
> 
>
Dmitry Antipov March 4, 2024, 4:35 p.m. UTC | #4
On 2/23/24 06:36, Wen Gu wrote:

> One solution to this issue I can think of is to check whether
> filp->private_data has been changed when the sock_fasync holds the sock lock,
> but it inevitably changes the general code..
> 
> diff --git a/net/socket.c b/net/socket.c
> index ed3df2f749bf..a28435195854 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -1443,6 +1443,11 @@ static int sock_fasync(int fd, struct file *filp, int on)
>                  return -EINVAL;
> 
>          lock_sock(sk);
> +       /* filp->private_data has changed */
> +       if (on && unlikely(sock != filp->private_data)) {
> +               release_sock(sk);
> +               return -EAGAIN;
> +       }
>          fasync_helper(fd, filp, on, &wq->fasync_list);
> 
>          if (!wq->fasync_list)
> 
> Let's see if anyone else has a better idea.

IIUC this is not a solution just because it decreases the probability of the race
but doesn't eliminate it completely - an underlying socket switch (e.g. changing
'filp->private_data') may happen when 'fasync_helper()' is in progress.

Dmitry
Wen Gu March 6, 2024, 2:45 p.m. UTC | #5
On 2024/3/5 00:35, Dmitry Antipov wrote:
> On 2/23/24 06:36, Wen Gu wrote:
> 
>> One solution to this issue I can think of is to check whether
>> filp->private_data has been changed when the sock_fasync holds the sock lock,
>> but it inevitably changes the general code..
>>
>> diff --git a/net/socket.c b/net/socket.c
>> index ed3df2f749bf..a28435195854 100644
>> --- a/net/socket.c
>> +++ b/net/socket.c
>> @@ -1443,6 +1443,11 @@ static int sock_fasync(int fd, struct file *filp, int on)
>>                  return -EINVAL;
>>
>>          lock_sock(sk);
>> +       /* filp->private_data has changed */
>> +       if (on && unlikely(sock != filp->private_data)) {
>> +               release_sock(sk);
>> +               return -EAGAIN;
>> +       }
>>          fasync_helper(fd, filp, on, &wq->fasync_list);
>>
>>          if (!wq->fasync_list)
>>
>> Let's see if anyone else has a better idea.
> 
> IIUC this is not a solution just because it decreases the probability of the race
> but doesn't eliminate it completely - an underlying socket switch (e.g. changing
> 'filp->private_data') may happen when 'fasync_helper()' is in progress.
> 
Hi Dmitry,

IIUC, the fallback (or more precisely the private_data change) essentially
always happens when the lock_sock(smc->sk) is held, except in smc_listen_work()
or smc_listen_decline(), but at that moment, userspace program can not yet
acquire this new socket to add fasync entries to the fasync_list.

So IMHO, the above patch should work, since it checks the private_data under
the lock_sock(sk). But if I missed something, please correct me.

And I wonder if you can still see the leak with the patch above through
your reproducer or syzbot's reproducer? I once ran your reproducer for about
50 mins and didn't see the leak.

Thanks!

> Dmitry
> 
>
Dmitry Antipov March 6, 2024, 6:07 p.m. UTC | #6
On 3/6/24 17:45, Wen Gu wrote:

> IIUC, the fallback (or more precisely the private_data change) essentially
> always happens when the lock_sock(smc->sk) is held, except in smc_listen_work()
> or smc_listen_decline(), but at that moment, userspace program can not yet
> acquire this new socket to add fasync entries to the fasync_list.
> 
> So IMHO, the above patch should work, since it checks the private_data under
> the lock_sock(sk). But if I missed something, please correct me.

Well, the whole picture is somewhat more complicated. Consider the
following diagram (an underlying kernel socket is in [], e.g. [smc->sk]):

Thread 0                        Thread 1

ioctl(sock, FIOASYNC, [1])
...
sock = filp->private_data;
lock_sock(sock [smc->sk]);
sock_fasync(sock, ..., 1)       ; new fasync_struct linked to smc->sk
release_sock(sock [smc->sk]);
                                 ...
                                 lock_sock([smc->sk]);
                                 ...
                                 smc_switch_to_fallback()
                                 ...
                                 smc->clcsock->file->private_data = smc->clcsock;
                                 ...
                                 release_sock([smc->sk]);
ioctl(sock, FIOASYNC, [0])
...
sock = filp->private_data;
lock_sock(sock [smc->clcsock]);
sock_fasync(sock, ..., 0)       ; nothing to unlink from smc->clcsock
                                 ; since fasync entry was linked to smc->sk
release_sock(sock [smc->clcsock]);
                                 ...
                                 close(sock [smc->clcsock]);
                                 __fput(...);
                                 file->f_op->fasync(sock, [0])   ; always failed -
                                                                 ; should use
                                                                 ; smc->sk instead
                                 file->f_op->release()
                                    ...
                                    smc_restore_fallback_changes()
                                    ...
                                    file->private_data = smc->sk.sk_socket;

That is, smc_restore_fallback_changes() restores filp->private_data to
smc->sk. If __fput() would have called file->f_op->release() _before_
file->f_op->fasync(), the fix would be as simple as adding

smc->sk.sk_socket->wq.fasync_list = smc->clcsock->wq.fasync_list;

to smc_restore_fallback_changes(). But since file->f_op->fasync() is called
before file->f_op->release(), the former always makes an attempt to unlink fasync
entry from smc->clcsock instead of smc->sk, thus introducing the memory leak.

And an idea with shared wait queue was intended in attempt to eliminate
this chicken-egg lookalike problem completely.

Dmitry
Jan Karcher March 7, 2024, 8:58 a.m. UTC | #7
On 06/03/2024 19:07, Dmitry Antipov wrote:
> 
> Well, the whole picture is somewhat more complicated. Consider the
> following diagram (an underlying kernel socket is in [], e.g. [smc->sk]):
> 
> Thread 0                        Thread 1
> 
> ioctl(sock, FIOASYNC, [1])
> ...
> sock = filp->private_data;
> lock_sock(sock [smc->sk]);
> sock_fasync(sock, ..., 1)       ; new fasync_struct linked to smc->sk
> release_sock(sock [smc->sk]);
>                                  ...
>                                  lock_sock([smc->sk]);
>                                  ...
>                                  smc_switch_to_fallback()
>                                  ...
>                                  smc->clcsock->file->private_data = 
> smc->clcsock;
>                                  ...
>                                  release_sock([smc->sk]);
> ioctl(sock, FIOASYNC, [0])
> ...
> sock = filp->private_data;
> lock_sock(sock [smc->clcsock]);
> sock_fasync(sock, ..., 0)       ; nothing to unlink from smc->clcsock
>                                  ; since fasync entry was linked to smc->sk
> release_sock(sock [smc->clcsock]);
>                                  ...
>                                  close(sock [smc->clcsock]);
>                                  __fput(...);
>                                  file->f_op->fasync(sock, [0])   ; 
> always failed -
>                                                                  ; 
> should use
>                                                                  ; 
> smc->sk instead
>                                  file->f_op->release()
>                                     ...
>                                     smc_restore_fallback_changes()
>                                     ...
>                                     file->private_data = smc->sk.sk_socket;

Thank you Dmitry for your detailed explanations.
It helped me a lot trying to understand the problem.
And I'm still in the progress of trying to understand it.
 From my naive point of view:
Would it help if we catch the ioctl and check if there is an active 
fallback and either stall or route the ioctl to the correct socket?
I've seen that there is an ioctl function handle in the proto_ops.
So on a very abstract level we could do the following:
1. Indicate an active Fallback at the start of the fallback to tcp.
2. Catch ioctls.
3. Check if there is an active fallback.
	NO: Pass the ioctl.
	YES: Wait for the fallback to complete and pass after.

If this blocks too much we can obviously do some finer checks there.
E.g.: Just check if the private data is already at attached to the socket.

Do you think this would be a suiteable solution?
I'm going to look into your proposal Dmitry to see how you solved the 
problem and to understand it better.

Thanks
- Jan

> 
> That is, smc_restore_fallback_changes() restores filp->private_data to
> smc->sk. If __fput() would have called file->f_op->release() _before_
> file->f_op->fasync(), the fix would be as simple as adding
> 
> smc->sk.sk_socket->wq.fasync_list = smc->clcsock->wq.fasync_list;
> 
> to smc_restore_fallback_changes(). But since file->f_op->fasync() is called
> before file->f_op->release(), the former always makes an attempt to 
> unlink fasync
> entry from smc->clcsock instead of smc->sk, thus introducing the memory 
> leak.
> 
> And an idea with shared wait queue was intended in attempt to eliminate
> this chicken-egg lookalike problem completely.
> 
> Dmitry
>
Jan Karcher March 7, 2024, 9:57 a.m. UTC | #8
On 06/03/2024 19:07, Dmitry Antipov wrote:
> On 3/6/24 17:45, Wen Gu wrote:
> 
>> IIUC, the fallback (or more precisely the private_data change) 
>> essentially
>> always happens when the lock_sock(smc->sk) is held, except in 
>> smc_listen_work()
>> or smc_listen_decline(), but at that moment, userspace program can not 
>> yet
>> acquire this new socket to add fasync entries to the fasync_list.
>>
>> So IMHO, the above patch should work, since it checks the private_data 
>> under
>> the lock_sock(sk). But if I missed something, please correct me.
> 
> Well, the whole picture is somewhat more complicated. Consider the
> following diagram (an underlying kernel socket is in [], e.g. [smc->sk]):
> 
> Thread 0                        Thread 1
> 
> ioctl(sock, FIOASYNC, [1])
> ...
> sock = filp->private_data;
> lock_sock(sock [smc->sk]);
> sock_fasync(sock, ..., 1)       ; new fasync_struct linked to smc->sk
> release_sock(sock [smc->sk]);
>                                  ...
>                                  lock_sock([smc->sk]);
>                                  ...
>                                  smc_switch_to_fallback()
>                                  ...
>                                  smc->clcsock->file->private_data = 
> smc->clcsock;
>                                  ...
>                                  release_sock([smc->sk]);
> ioctl(sock, FIOASYNC, [0])
> ...
> sock = filp->private_data;
> lock_sock(sock [smc->clcsock]);
> sock_fasync(sock, ..., 0)       ; nothing to unlink from smc->clcsock
>                                  ; since fasync entry was linked to smc->sk
> release_sock(sock [smc->clcsock]);
>                                  ...
>                                  close(sock [smc->clcsock]);
>                                  __fput(...);
>                                  file->f_op->fasync(sock, [0])   ; 
> always failed -
>                                                                  ; 
> should use
>                                                                  ; 
> smc->sk instead
>                                  file->f_op->release()
>                                     ...
>                                     smc_restore_fallback_changes()
>                                     ...
>                                     file->private_data = smc->sk.sk_socket;
> 
> That is, smc_restore_fallback_changes() restores filp->private_data to
> smc->sk. If __fput() would have called file->f_op->release() _before_
> file->f_op->fasync(), the fix would be as simple as adding
> 
> smc->sk.sk_socket->wq.fasync_list = smc->clcsock->wq.fasync_list;
> 
> to smc_restore_fallback_changes(). But since file->f_op->fasync() is called
> before file->f_op->release(), the former always makes an attempt to 
> unlink fasync
> entry from smc->clcsock instead of smc->sk, thus introducing the memory 
> leak.
> 
> And an idea with shared wait queue was intended in attempt to eliminate
> this chicken-egg lookalike problem completely.
> 
> Dmitry
> 

Me and Gerd had another look at this.
The infrastructure for what i proposed in the last E-Mail regarding the 
ioctl function handler is already there (af_smc.c#smc_ioctl).
There we already check if we are in a active fallback to send the ioctls 
to the clcsock instead of the sk socket.

```
	lock_sock(&smc->sk);
	if (smc->use_fallback) {
		if (!smc->clcsock) {
			release_sock(&smc->sk);
			return -EBADF;
		}
		answ = smc->clcsock->ops->ioctl(smc->clcsock, cmd, arg);
		release_sock(&smc->sk);
		return answ;
	}
```

We think it might be an option to secure the path in this function with 
the smc->clcsock_release_lock.

```
	lock_sock(&smc->sk);
	if (smc->use_fallback) {
		if (!smc->clcsock) {
			release_sock(&smc->sk);
			return -EBADF;
		}
+		mutex_lock(&smc->clcsock_release_lock);
		answ = smc->clcsock->ops->ioctl(smc->clcsock, cmd, arg);
+		mutex_unlock(&smc->clcsock_release_lock);
		release_sock(&smc->sk);
		return answ;
	}
```

What do yo think about this?
I'm going to test this idea and see if we canget rid of the leak this way.

Thanks
- Jan & Gerd
Antipov, Dmitriy March 7, 2024, 10:21 a.m. UTC | #9
On Thu, 2024-03-07 at 10:57 +0100, Jan Karcher wrote:

> We think it might be an option to secure the path in this function with 
> the smc->clcsock_release_lock.
> 
> ```
> 	lock_sock(&smc->sk);
> 	if (smc->use_fallback) {
> 		if (!smc->clcsock) {
> 			release_sock(&smc->sk);
> 			return -EBADF;
> 		}
> +		mutex_lock(&smc->clcsock_release_lock);
> 		answ = smc->clcsock->ops->ioctl(smc->clcsock, cmd, arg);
> +		mutex_unlock(&smc->clcsock_release_lock);
> 		release_sock(&smc->sk);
> 		return answ;
> 	}
> ```
> 
> What do yo think about this?

You're trying to fix it on the wrong path. FIOASYNC is a generic rather
than protocol-specific thing. So userspace 'ioctl(sock, FIOASYNC, [])'
call is handled with:

-> sys_ioctl()
  -> do_vfs_ioctl()
    -> ioctl_fioasync()
      -> filp->f_op->fasync() (which is sock_fasync() for all sockets)

rather than 'sock->ops->ioctl(...)'.

Dmitry
Wen Gu March 7, 2024, 1:53 p.m. UTC | #10
On 2024/3/7 02:07, Dmitry Antipov wrote:
> On 3/6/24 17:45, Wen Gu wrote:
> 
>> IIUC, the fallback (or more precisely the private_data change) essentially
>> always happens when the lock_sock(smc->sk) is held, except in smc_listen_work()
>> or smc_listen_decline(), but at that moment, userspace program can not yet
>> acquire this new socket to add fasync entries to the fasync_list.
>>
>> So IMHO, the above patch should work, since it checks the private_data under
>> the lock_sock(sk). But if I missed something, please correct me.
> 
> Well, the whole picture is somewhat more complicated. Consider the
> following diagram (an underlying kernel socket is in [], e.g. [smc->sk]):
> 
> Thread 0                        Thread 1
> 
> ioctl(sock, FIOASYNC, [1])
> ...
> sock = filp->private_data;
> lock_sock(sock [smc->sk]);
> sock_fasync(sock, ..., 1)       ; new fasync_struct linked to smc->sk
> release_sock(sock [smc->sk]);
>                                  ...
>                                  lock_sock([smc->sk]);
>                                  ...
>                                  smc_switch_to_fallback()
>                                  ...
>                                  smc->clcsock->file->private_data = smc->clcsock;
>                                  ...
>                                  release_sock([smc->sk]);
> ioctl(sock, FIOASYNC, [0])
> ...
> sock = filp->private_data;
> lock_sock(sock [smc->clcsock]);
> sock_fasync(sock, ..., 0)       ; nothing to unlink from smc->clcsock
>                                  ; since fasync entry was linked to smc->sk
> release_sock(sock [smc->clcsock]);


I don't understand why the fasync entry now can't be removed from
clcsock->wq.fasync_list? since the fasync entry has been moved to
clcsock->wq.fasync_list during fallback.

The process you described above is:

1) An fasync entry was added into smc->sk.sk_socket->wq.fasync_list;
2) then fallback occurs, and the fasync entry is moved to clcsock->wq.fasync_list,
    and file->private_data is changed to smc->clcsock;
3) lastly we removed the fasync entry from clcsock->wq.fasync_list.


It can be reproduced by follows, right?

#include <signal.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <stdio.h>

int main (int argc, char *argv[])
{
         struct msghdr msg = { 0 };
         int sock;
         int on;

         sock = socket(AF_SMC, SOCK_STREAM, 0);

         /* add fasync entry */
         on = 1;
         ioctl(sock, FIOASYNC, &on);

         /* fallback */
         sendmsg(sock, &msg, MSG_FASTOPEN);

         /* remove fasync entry */
         on = 0;
         ioctl(sock, FIOASYNC, &on);

         close(sock);
         return 0;
}

and I added some prints in the kernel for quick debug:

diff --git a/fs/fcntl.c b/fs/fcntl.c
index c80a6acad742..79b8e435c380 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -880,6 +880,7 @@ int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
                 call_rcu(&fa->fa_rcu, fasync_free_rcu);
                 filp->f_flags &= ~FASYNC;
                 result = 1;
+               pr_warn("%s: wq->fasync_list %pK, fasync entry %pK\n", __func__, &(*fapp), fa);
                 break;
         }
         spin_unlock(&fasync_lock);
@@ -932,6 +933,7 @@ struct fasync_struct *fasync_insert_entry(int fd, struct file *filp, struct fasy
         new->fa_next = *fapp;
         rcu_assign_pointer(*fapp, new);
         filp->f_flags |= FASYNC;
+       pr_warn("%s: wq->fasync_list %pK, fasync entry %pK\n", __func__, &(*fapp), new);

  out:
         spin_unlock(&fasync_lock);
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 4b52b3b159c0..3b9737d42dbd 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -925,6 +925,9 @@ static int smc_switch_to_fallback(struct smc_sock *smc, int reason_code)
                 smc->clcsock->wq.fasync_list =
                         smc->sk.sk_socket->wq.fasync_list;
                 smc->sk.sk_socket->wq.fasync_list = NULL;
+               pr_warn("%s: smc->sk wq.fasync_list %pK, clcsock->wq.fasync_list %pK\n",
+                       __func__, &smc->sk.sk_socket->wq.fasync_list,
+                       &smc->clcsock->wq.fasync_list);

                 /* There might be some wait entries remaining
                  * in smc sk->sk_wq and they should be woken up



ran the reproducer, and dmesg shows:

[] fasync_insert_entry: wq->fasync_list ffff96fdc0425e98, fasync entry ffff96fdccc62690
[] smc: smc_switch_to_fallback: smc->sk wq.fasync_list ffff96fdc0425e98, clcsock->wq.fasync_list ffff96fdc0426ed8
[] fasync_remove_entry: wq->fasync_list ffff96fdc0426ed8, fasync entry ffff96fdccc62690

It shows that
1. an fasync entry ffff96fdccc62690 is added into ffff96fdc0425e98 (smc->sk wq.fasync_list)
2. then fallback, all the fasync entris in smc->sk wq.fasync_list will be moved to clcsock->wq.fasync_list.
3. then the fasync entry ffff96fdccc62690 (the one in #1) is removed from ffff96fdc0426ed8 (clcsock->wq.fasync_list)

What's wrong with this?




In fact, I think what does cause this leak (maybe one of causes) is the race
I discribed through the diagram in
https://lore.kernel.org/netdev/19d7d71b-c911-45cc-9671-235d98720be6@linux.alibaba.com/

1) sock_fasync() got the filp->private_data->wq (that is smc->sk.sk_socket->wq)
    and record it in 'wq';
2) meanwhile, fallback occurs and filp->private_data changed, and from now on,
    user can only operate the clcsock based on file->private_data;
3) (race here) and sock_fasync() keep going and add an entry to 'wq'->fasync_list
    (that is smc->sk.sk_socket->wq); This fasync entry is the one that we can't
    removed later, since we start to operate clcsock after fallback.


Thanks!

>                                  ...
>                                  close(sock [smc->clcsock]);
>                                  __fput(...);
>                                  file->f_op->fasync(sock, [0])   ; always failed -
>                                                                  ; should use
>                                                                  ; smc->sk instead
>                                  file->f_op->release()
>                                     ...
>                                     smc_restore_fallback_changes()
>                                     ...
>                                     file->private_data = smc->sk.sk_socket;
> 
> That is, smc_restore_fallback_changes() restores filp->private_data to
> smc->sk. If __fput() would have called file->f_op->release() _before_
> file->f_op->fasync(), the fix would be as simple as adding
> 
> smc->sk.sk_socket->wq.fasync_list = smc->clcsock->wq.fasync_list;
> 
> to smc_restore_fallback_changes(). But since file->f_op->fasync() is called
> before file->f_op->release(), the former always makes an attempt to unlink fasync
> entry from smc->clcsock instead of smc->sk, thus introducing the memory leak.
> 
> And an idea with shared wait queue was intended in attempt to eliminate
> this chicken-egg lookalike problem completely.
> 
> Dmitry
Antipov, Dmitriy March 26, 2024, 8:18 a.m. UTC | #11
On Thu, 2024-03-07 at 13:21 +0300, Dmitry Antipov wrote:

> On Thu, 2024-03-07 at 10:57 +0100, Jan Karcher wrote:
> 
> > We think it might be an option to secure the path in this function with 
> > the smc->clcsock_release_lock.
> > 
> > ```
> > 	lock_sock(&smc->sk);
> > 	if (smc->use_fallback) {
> > 		if (!smc->clcsock) {
> > 			release_sock(&smc->sk);
> > 			return -EBADF;
> > 		}
> > +		mutex_lock(&smc->clcsock_release_lock);
> > 		answ = smc->clcsock->ops->ioctl(smc->clcsock, cmd, arg);
> > +		mutex_unlock(&smc->clcsock_release_lock);
> > 		release_sock(&smc->sk);
> > 		return answ;
> > 	}
> > ```
> > 
> > What do yo think about this?
> 
> You're trying to fix it on the wrong path. FIOASYNC is a generic rather
> than protocol-specific thing. So userspace 'ioctl(sock, FIOASYNC, [])'
> call is handled with:
> 
> -> sys_ioctl()
>   -> do_vfs_ioctl()
>     -> ioctl_fioasync()
>       -> filp->f_op->fasync() (which is sock_fasync() for all sockets)
> 
> rather than 'sock->ops->ioctl(...)'.

Any progress on this?

Dmitry
Wen Gu March 27, 2024, 6:12 a.m. UTC | #12
On 2024/3/26 16:18, Antipov, Dmitriy wrote:
> On Thu, 2024-03-07 at 13:21 +0300, Dmitry Antipov wrote:
> 
>> On Thu, 2024-03-07 at 10:57 +0100, Jan Karcher wrote:
>>
>>> We think it might be an option to secure the path in this function with
>>> the smc->clcsock_release_lock.
>>>
>>> ```
>>> 	lock_sock(&smc->sk);
>>> 	if (smc->use_fallback) {
>>> 		if (!smc->clcsock) {
>>> 			release_sock(&smc->sk);
>>> 			return -EBADF;
>>> 		}
>>> +		mutex_lock(&smc->clcsock_release_lock);
>>> 		answ = smc->clcsock->ops->ioctl(smc->clcsock, cmd, arg);
>>> +		mutex_unlock(&smc->clcsock_release_lock);
>>> 		release_sock(&smc->sk);
>>> 		return answ;
>>> 	}
>>> ```
>>>
>>> What do yo think about this?
>>
>> You're trying to fix it on the wrong path. FIOASYNC is a generic rather
>> than protocol-specific thing. So userspace 'ioctl(sock, FIOASYNC, [])'
>> call is handled with:
>>
>> -> sys_ioctl()
>>    -> do_vfs_ioctl()
>>      -> ioctl_fioasync()
>>        -> filp->f_op->fasync() (which is sock_fasync() for all sockets)
>>
>> rather than 'sock->ops->ioctl(...)'.
> 
> Any progress on this?

Hi Dmitry,

In my opinion, first we need to figure out what the root cause(race) of this leak is.
I am not very convinced about your analysis[1] and gave some my thoughts about it[2].
I would appreciate if you give your response about that to make this issue clearer and
get everyone on the same page (including SMC maintainers). Then we can see if your other
proposal[3] is a proper solution to the issue or if anyone has a better idea.

[1] https://lore.kernel.org/netdev/35584a9f-f4c2-423a-8bb8-2c729cedb6fe@yandex.ru/
[2] https://lore.kernel.org/netdev/a88a0731-6cbe-4987-b1e9-afa51f9ab057@linux.alibaba.com/
[3] https://lore.kernel.org/netdev/625c9519-7ae6-43a3-a5d0-81164ad7fd0e@yandex.ru/

Thanks.

> 
> Dmitry
>
diff mbox series

Patch

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 0f53a5c6fd9d..68cde9db5d2f 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -337,9 +337,13 @@  static int smc_release(struct socket *sock)
 	else
 		lock_sock(sk);
 
-	if (old_state == SMC_INIT && sk->sk_state == SMC_ACTIVE &&
-	    !smc->use_fallback)
+	if (smc->use_fallback) {
+		/* FIXME: ugly and should be done in some other way */
+		if (sock->wq.fasync_list)
+			fasync_remove_entry(sock->file, &sock->wq.fasync_list);
+	} else if (old_state == SMC_INIT && sk->sk_state == SMC_ACTIVE) {
 		smc_close_active_abort(smc);
+	}
 
 	rc = __smc_release(smc);