mbox series

[RFC,v1,0/3] virtio/vsock: use SO_RCVLOWAT to set POLLIN/POLLRDNORM

Message ID c8de13b1-cbd8-e3e0-5728-f3c3648c69f7@sberdevices.ru (mailing list archive)
Headers show
Series virtio/vsock: use SO_RCVLOWAT to set POLLIN/POLLRDNORM | expand

Message

Arseniy Krasnov July 18, 2022, 8:12 a.m. UTC
Hello,

during my experiments with zerocopy receive, i found, that in some
cases, poll() implementation violates POSIX: when socket has non-
default SO_RCVLOWAT(e.g. not 1), poll() will always set POLLIN and
POLLRDNORM bits in 'revents' even number of bytes available to read
on socket is smaller than SO_RCVLOWAT value. In this case,user sees
POLLIN flag and then tries to read data(for example using  'read()'
call), but read call will be blocked, because  SO_RCVLOWAT logic is
supported in dequeue loop in af_vsock.c. But the same time,  POSIX
requires that:

"POLLIN     Data other than high-priority data may be read without
            blocking.
 POLLRDNORM Normal data may be read without blocking."

See https://www.open-std.org/jtc1/sc22/open/n4217.pdf, page 293.

So, we have, that poll() syscall returns POLLIN, but read call will
be blocked.

Also in man page socket(7) i found that:

"Since Linux 2.6.28, select(2), poll(2), and epoll(7) indicate a
socket as readable only if at least SO_RCVLOWAT bytes are available."

I checked TCP callback for poll()(net/ipv4/tcp.c, tcp_poll()), it
uses SO_RCVLOWAT value to set POLLIN bit, also i've tested TCP with
this case for TCP socket, it works as POSIX required.

I've added some fixes to af_vsock.c and virtio_transport_common.c,
test is also implemented.

What do You think guys?

Thank You

Arseniy Krasnov(3):
 vsock_test: POLLIN + SO_RCVLOWAT test.
 virtio/vsock: use 'target' in notify_poll_in callback.
 vsock: use sk_skrcvlowat to set POLLIN,POLLRDNORM bits.

 net/vmw_vsock/af_vsock.c                |  2 +-
 net/vmw_vsock/virtio_transport_common.c |  2 +-
 tools/testing/vsock/vsock_test.c        | 90 +++++++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+), 2 deletions(-)

Comments

Stefano Garzarella July 19, 2022, 12:58 p.m. UTC | #1
On Mon, Jul 18, 2022 at 08:12:52AM +0000, Arseniy Krasnov wrote:
>Hello,
>
>during my experiments with zerocopy receive, i found, that in some
>cases, poll() implementation violates POSIX: when socket has non-
>default SO_RCVLOWAT(e.g. not 1), poll() will always set POLLIN and
>POLLRDNORM bits in 'revents' even number of bytes available to read
>on socket is smaller than SO_RCVLOWAT value. In this case,user sees
>POLLIN flag and then tries to read data(for example using  'read()'
>call), but read call will be blocked, because  SO_RCVLOWAT logic is
>supported in dequeue loop in af_vsock.c. But the same time,  POSIX
>requires that:
>
>"POLLIN     Data other than high-priority data may be read without
>            blocking.
> POLLRDNORM Normal data may be read without blocking."
>
>See https://www.open-std.org/jtc1/sc22/open/n4217.pdf, page 293.
>
>So, we have, that poll() syscall returns POLLIN, but read call will
>be blocked.
>
>Also in man page socket(7) i found that:
>
>"Since Linux 2.6.28, select(2), poll(2), and epoll(7) indicate a
>socket as readable only if at least SO_RCVLOWAT bytes are available."
>
>I checked TCP callback for poll()(net/ipv4/tcp.c, tcp_poll()), it
>uses SO_RCVLOWAT value to set POLLIN bit, also i've tested TCP with
>this case for TCP socket, it works as POSIX required.

I tried to look at the code and it seems that only TCP complies with it 
or am I wrong?

>
>I've added some fixes to af_vsock.c and virtio_transport_common.c,
>test is also implemented.
>
>What do You think guys?

Nice, thanks for fixing this and for the test!

I left some comments, but I think the series is fine if we will support 
it in all transports.

I'd just like to understand if it's just TCP complying with it or I'm 
missing some check included in the socket layer that we could reuse.

@David, @Jakub, @Paolo, any advice?

Thanks,
Stefano
Arseniy Krasnov July 20, 2022, 6:07 a.m. UTC | #2
On 19.07.2022 15:58, Stefano Garzarella wrote:
> On Mon, Jul 18, 2022 at 08:12:52AM +0000, Arseniy Krasnov wrote:
>> Hello,
>>
>> during my experiments with zerocopy receive, i found, that in some
>> cases, poll() implementation violates POSIX: when socket has non-
>> default SO_RCVLOWAT(e.g. not 1), poll() will always set POLLIN and
>> POLLRDNORM bits in 'revents' even number of bytes available to read
>> on socket is smaller than SO_RCVLOWAT value. In this case,user sees
>> POLLIN flag and then tries to read data(for example using  'read()'
>> call), but read call will be blocked, because  SO_RCVLOWAT logic is
>> supported in dequeue loop in af_vsock.c. But the same time,  POSIX
>> requires that:
>>
>> "POLLIN     Data other than high-priority data may be read without
>>            blocking.
>> POLLRDNORM Normal data may be read without blocking."
>>
>> See https://www.open-std.org/jtc1/sc22/open/n4217.pdf, page 293.
>>
>> So, we have, that poll() syscall returns POLLIN, but read call will
>> be blocked.
>>
>> Also in man page socket(7) i found that:
>>
>> "Since Linux 2.6.28, select(2), poll(2), and epoll(7) indicate a
>> socket as readable only if at least SO_RCVLOWAT bytes are available."
>>
>> I checked TCP callback for poll()(net/ipv4/tcp.c, tcp_poll()), it
>> uses SO_RCVLOWAT value to set POLLIN bit, also i've tested TCP with
>> this case for TCP socket, it works as POSIX required.
> 
> I tried to look at the code and it seems that only TCP complies with it or am I wrong?
Yes, i checked AF_UNIX, it also don't care about that. It calls skb_queue_empty() that of
course ignores SO_RCVLOWAT.
> 
>>
>> I've added some fixes to af_vsock.c and virtio_transport_common.c,
>> test is also implemented.
>>
>> What do You think guys?
> 
> Nice, thanks for fixing this and for the test!
> 
> I left some comments, but I think the series is fine if we will support it in all transports.
Ack
> 
> I'd just like to understand if it's just TCP complying with it or I'm missing some check included in the socket layer that we could reuse.
Seems sock_poll() which is socket layer entry point for poll() doesn't contain any such checks
> 
> @David, @Jakub, @Paolo, any advice?
> 
> Thanks,
> Stefano
> 

PS: moreover, i found one more interesting thing with TCP and poll: TCP receive logic wakes up poll waiter
only when number of available bytes > SO_RCVLOWAT. E.g. it prevents "spurious" wake ups, when poll will be
woken up because new data arrived, but POLLIN to allow user dequeue this data won't be set(as amount of data
is too small).
See tcp_data_ready() in net/ipv4/tcp_input.c

Thanks
Stefano Garzarella July 20, 2022, 9:30 a.m. UTC | #3
On Wed, Jul 20, 2022 at 06:07:47AM +0000, Arseniy Krasnov wrote:
>On 19.07.2022 15:58, Stefano Garzarella wrote:
>> On Mon, Jul 18, 2022 at 08:12:52AM +0000, Arseniy Krasnov wrote:
>>> Hello,
>>>
>>> during my experiments with zerocopy receive, i found, that in some
>>> cases, poll() implementation violates POSIX: when socket has non-
>>> default SO_RCVLOWAT(e.g. not 1), poll() will always set POLLIN and
>>> POLLRDNORM bits in 'revents' even number of bytes available to read
>>> on socket is smaller than SO_RCVLOWAT value. In this case,user sees
>>> POLLIN flag and then tries to read data(for example using  'read()'
>>> call), but read call will be blocked, because  SO_RCVLOWAT logic is
>>> supported in dequeue loop in af_vsock.c. But the same time,  POSIX
>>> requires that:
>>>
>>> "POLLIN     Data other than high-priority data may be read without
>>>            blocking.
>>> POLLRDNORM Normal data may be read without blocking."
>>>
>>> See https://www.open-std.org/jtc1/sc22/open/n4217.pdf, page 293.
>>>
>>> So, we have, that poll() syscall returns POLLIN, but read call will
>>> be blocked.
>>>
>>> Also in man page socket(7) i found that:
>>>
>>> "Since Linux 2.6.28, select(2), poll(2), and epoll(7) indicate a
>>> socket as readable only if at least SO_RCVLOWAT bytes are available."
>>>
>>> I checked TCP callback for poll()(net/ipv4/tcp.c, tcp_poll()), it
>>> uses SO_RCVLOWAT value to set POLLIN bit, also i've tested TCP with
>>> this case for TCP socket, it works as POSIX required.
>>
>> I tried to look at the code and it seems that only TCP complies with it or am I wrong?
>Yes, i checked AF_UNIX, it also don't care about that. It calls skb_queue_empty() that of
>course ignores SO_RCVLOWAT.
>>
>>>
>>> I've added some fixes to af_vsock.c and virtio_transport_common.c,
>>> test is also implemented.
>>>
>>> What do You think guys?
>>
>> Nice, thanks for fixing this and for the test!
>>
>> I left some comments, but I think the series is fine if we will support it in all transports.
>Ack
>>
>> I'd just like to understand if it's just TCP complying with it or I'm missing some check included in the socket layer that we could reuse.
>Seems sock_poll() which is socket layer entry point for poll() doesn't contain any such checks
>>
>> @David, @Jakub, @Paolo, any advice?
>>
>> Thanks,
>> Stefano
>>
>
>PS: moreover, i found one more interesting thing with TCP and poll: TCP receive logic wakes up poll waiter
>only when number of available bytes > SO_RCVLOWAT. E.g. it prevents "spurious" wake ups, when poll will be
>woken up because new data arrived, but POLLIN to allow user dequeue this data won't be set(as amount of data
>is too small).
>See tcp_data_ready() in net/ipv4/tcp_input.c

Do you mean that we should call sk->sk_data_ready(sk) checking 
SO_RCVLOWAT?

It seems fine, maybe we can add vsock_data_ready() in af_vsock.c that 
transports should call instead of calling sk->sk_data_ready(sk) 
directly.

Then we can something similar to tcp_data_ready().

Thanks,
Stefano
Arseniy Krasnov July 20, 2022, 10:52 a.m. UTC | #4
On 20.07.2022 12:30, Stefano Garzarella wrote:
> On Wed, Jul 20, 2022 at 06:07:47AM +0000, Arseniy Krasnov wrote:
>> On 19.07.2022 15:58, Stefano Garzarella wrote:
>>> On Mon, Jul 18, 2022 at 08:12:52AM +0000, Arseniy Krasnov wrote:
>>>> Hello,
>>>>
>>>> during my experiments with zerocopy receive, i found, that in some
>>>> cases, poll() implementation violates POSIX: when socket has non-
>>>> default SO_RCVLOWAT(e.g. not 1), poll() will always set POLLIN and
>>>> POLLRDNORM bits in 'revents' even number of bytes available to read
>>>> on socket is smaller than SO_RCVLOWAT value. In this case,user sees
>>>> POLLIN flag and then tries to read data(for example using  'read()'
>>>> call), but read call will be blocked, because  SO_RCVLOWAT logic is
>>>> supported in dequeue loop in af_vsock.c. But the same time,  POSIX
>>>> requires that:
>>>>
>>>> "POLLIN     Data other than high-priority data may be read without
>>>>            blocking.
>>>> POLLRDNORM Normal data may be read without blocking."
>>>>
>>>> See https://www.open-std.org/jtc1/sc22/open/n4217.pdf, page 293.
>>>>
>>>> So, we have, that poll() syscall returns POLLIN, but read call will
>>>> be blocked.
>>>>
>>>> Also in man page socket(7) i found that:
>>>>
>>>> "Since Linux 2.6.28, select(2), poll(2), and epoll(7) indicate a
>>>> socket as readable only if at least SO_RCVLOWAT bytes are available."
>>>>
>>>> I checked TCP callback for poll()(net/ipv4/tcp.c, tcp_poll()), it
>>>> uses SO_RCVLOWAT value to set POLLIN bit, also i've tested TCP with
>>>> this case for TCP socket, it works as POSIX required.
>>>
>>> I tried to look at the code and it seems that only TCP complies with it or am I wrong?
>> Yes, i checked AF_UNIX, it also don't care about that. It calls skb_queue_empty() that of
>> course ignores SO_RCVLOWAT.
>>>
>>>>
>>>> I've added some fixes to af_vsock.c and virtio_transport_common.c,
>>>> test is also implemented.
>>>>
>>>> What do You think guys?
>>>
>>> Nice, thanks for fixing this and for the test!
>>>
>>> I left some comments, but I think the series is fine if we will support it in all transports.
>> Ack
>>>
>>> I'd just like to understand if it's just TCP complying with it or I'm missing some check included in the socket layer that we could reuse.
>> Seems sock_poll() which is socket layer entry point for poll() doesn't contain any such checks
>>>
>>> @David, @Jakub, @Paolo, any advice?
>>>
>>> Thanks,
>>> Stefano
>>>
>>
>> PS: moreover, i found one more interesting thing with TCP and poll: TCP receive logic wakes up poll waiter
>> only when number of available bytes > SO_RCVLOWAT. E.g. it prevents "spurious" wake ups, when poll will be
>> woken up because new data arrived, but POLLIN to allow user dequeue this data won't be set(as amount of data
>> is too small).
>> See tcp_data_ready() in net/ipv4/tcp_input.c
> 
> Do you mean that we should call sk->sk_data_ready(sk) checking SO_RCVLOWAT?
Yes, like tcp_data_read().
> 
> It seems fine, maybe we can add vsock_data_ready() in af_vsock.c that transports should call instead of calling sk->sk_data_ready(sk) directly.
Yes, this will also update logic in vmci and hyperv transports
> 
> Then we can something similar to tcp_data_ready().
> 
> Thanks,
> Stefano
>