mbox series

[0/3] Advertise OS version

Message ID 20240619125708.3719150-1-christian.couder@gmail.com (mailing list archive)
Headers show
Series Advertise OS version | expand

Message

Christian Couder June 19, 2024, 12:57 p.m. UTC
For debugging and statistical purposes, it can be useful for Git
servers to know the OS the client are using.

So let's add a new 'os-version' capability to the v2 protocol, in the
same way as the existing 'agent' capability that lets clients and
servers exchange the Git version they are running.

This sends the same info as `git bugreport` is already sending, which
uses uname(2). It should be the same as what `uname -srvm` returns,
except that it is sanitized in the same way as the Git version sent by
the 'agent' capability is sanitized (by replacing character having an
ascii code less than 32 or more than 127 with '.').

CI tests are currently failing on Windows as it looks like uname(1)
and uname(2) don't report the same thing:

  -os-version=MINGW64_NT-10.0-20348.3.4.10-87d57229.x86_64.2024-02-14.20:17.UTC.x86_64
  +os-version=Windows.10.0.20348

(See: https://github.com/chriscool/git/actions/runs/9581822699)

Thoughts?

Christian Couder (3):
  version: refactor strbuf_sanitize()
  version: refactor get_uname_info()
  connect: advertise OS version

 Documentation/gitprotocol-v2.txt | 18 +++++++++
 builtin/bugreport.c              | 13 +------
 connect.c                        |  3 ++
 serve.c                          | 12 ++++++
 t/t5555-http-smart-common.sh     |  3 ++
 t/t5701-git-serve.sh             |  3 ++
 version.c                        | 67 ++++++++++++++++++++++++++++----
 version.h                        | 10 +++++
 8 files changed, 111 insertions(+), 18 deletions(-)

Comments

Dragan Simic June 19, 2024, 1:18 p.m. UTC | #1
Hello Christian,

On 2024-06-19 14:57, Christian Couder wrote:
> For debugging and statistical purposes, it can be useful for Git
> servers to know the OS the client are using.
> 
> So let's add a new 'os-version' capability to the v2 protocol, in the
> same way as the existing 'agent' capability that lets clients and
> servers exchange the Git version they are running.
> 
> This sends the same info as `git bugreport` is already sending, which
> uses uname(2). It should be the same as what `uname -srvm` returns,
> except that it is sanitized in the same way as the Git version sent by
> the 'agent' capability is sanitized (by replacing character having an
> ascii code less than 32 or more than 127 with '.').

This may probably be a useful debugging feature, but I strongly
suggest that a configuration knob exists that makes disabling it
possible.  For security reasons, some users may not want to
publicly advertise their OSes and kernel versions.  Count me in
as one of such users. :)

> CI tests are currently failing on Windows as it looks like uname(1)
> and uname(2) don't report the same thing:
> 
> 
> -os-version=MINGW64_NT-10.0-20348.3.4.10-87d57229.x86_64.2024-02-14.20:17.UTC.x86_64
>   +os-version=Windows.10.0.20348
> 
> (See: https://github.com/chriscool/git/actions/runs/9581822699)
> 
> Thoughts?
> 
> Christian Couder (3):
>   version: refactor strbuf_sanitize()
>   version: refactor get_uname_info()
>   connect: advertise OS version
> 
>  Documentation/gitprotocol-v2.txt | 18 +++++++++
>  builtin/bugreport.c              | 13 +------
>  connect.c                        |  3 ++
>  serve.c                          | 12 ++++++
>  t/t5555-http-smart-common.sh     |  3 ++
>  t/t5701-git-serve.sh             |  3 ++
>  version.c                        | 67 ++++++++++++++++++++++++++++----
>  version.h                        | 10 +++++
>  8 files changed, 111 insertions(+), 18 deletions(-)
Jeff King June 19, 2024, 1:45 p.m. UTC | #2
On Wed, Jun 19, 2024 at 03:18:40PM +0200, Dragan Simic wrote:

> Hello Christian,
> 
> On 2024-06-19 14:57, Christian Couder wrote:
> > For debugging and statistical purposes, it can be useful for Git
> > servers to know the OS the client are using.
> > 
> > So let's add a new 'os-version' capability to the v2 protocol, in the
> > same way as the existing 'agent' capability that lets clients and
> > servers exchange the Git version they are running.
> > 
> > This sends the same info as `git bugreport` is already sending, which
> > uses uname(2). It should be the same as what `uname -srvm` returns,
> > except that it is sanitized in the same way as the Git version sent by
> > the 'agent' capability is sanitized (by replacing character having an
> > ascii code less than 32 or more than 127 with '.').
> 
> This may probably be a useful debugging feature, but I strongly
> suggest that a configuration knob exists that makes disabling it
> possible.  For security reasons, some users may not want to
> publicly advertise their OSes and kernel versions.  Count me in
> as one of such users. :)

Agreed. We do send the Git version, which is already a slight privacy
issue (though it can be overridden at both build-time and run-time). But
OS details seems like crossing a line to me.

I don't mind if this is present but disabled by default, but then I
guess it is not really serving much of a purpose, as hardly anybody
would enable it. Which makes collecting large-scale statistics by
hosting providers pretty much useless (and I don't think it is all that
useful for debugging individual cases).

-Peff
Dragan Simic June 19, 2024, 1:50 p.m. UTC | #3
Hello Jeff,

On 2024-06-19 15:45, Jeff King wrote:
> On Wed, Jun 19, 2024 at 03:18:40PM +0200, Dragan Simic wrote:
>> On 2024-06-19 14:57, Christian Couder wrote:
>> > For debugging and statistical purposes, it can be useful for Git
>> > servers to know the OS the client are using.
>> >
>> > So let's add a new 'os-version' capability to the v2 protocol, in the
>> > same way as the existing 'agent' capability that lets clients and
>> > servers exchange the Git version they are running.
>> >
>> > This sends the same info as `git bugreport` is already sending, which
>> > uses uname(2). It should be the same as what `uname -srvm` returns,
>> > except that it is sanitized in the same way as the Git version sent by
>> > the 'agent' capability is sanitized (by replacing character having an
>> > ascii code less than 32 or more than 127 with '.').
>> 
>> This may probably be a useful debugging feature, but I strongly
>> suggest that a configuration knob exists that makes disabling it
>> possible.  For security reasons, some users may not want to
>> publicly advertise their OSes and kernel versions.  Count me in
>> as one of such users. :)
> 
> Agreed. We do send the Git version, which is already a slight privacy
> issue (though it can be overridden at both build-time and run-time). 
> But
> OS details seems like crossing a line to me.
> 
> I don't mind if this is present but disabled by default, but then I
> guess it is not really serving much of a purpose, as hardly anybody
> would enable it. Which makes collecting large-scale statistics by
> hosting providers pretty much useless (and I don't think it is all that
> useful for debugging individual cases).

I agree that it should actually be disabled by default, for privacy
and security reasons, but that would actually defeat its purpose, so
I'm not really sure should it be merged.
Christian Couder June 19, 2024, 2:01 p.m. UTC | #4
On Wed, Jun 19, 2024 at 3:50 PM Dragan Simic <dsimic@manjaro.org> wrote:

> > I don't mind if this is present but disabled by default, but then I
> > guess it is not really serving much of a purpose, as hardly anybody
> > would enable it. Which makes collecting large-scale statistics by
> > hosting providers pretty much useless (and I don't think it is all that
> > useful for debugging individual cases).
>
> I agree that it should actually be disabled by default, for privacy
> and security reasons, but that would actually defeat its purpose, so
> I'm not really sure should it be merged.

One possibility is to send just the `sysname`, described as 'Operating
system name (e.g., "Linux")', field of the struct utsname filled out
by uname(2) by default.

It should be the same as what `uname -s` prints, so "Linux" for a
Linux machine, and might be acceptable regarding privacy concerns.

And then there might be a knob to deactivate it completely or to make
it more verbose (which might be useful for example in a corporate
context).
Dragan Simic June 19, 2024, 2:19 p.m. UTC | #5
On 2024-06-19 16:01, Christian Couder wrote:
> On Wed, Jun 19, 2024 at 3:50 PM Dragan Simic <dsimic@manjaro.org> 
> wrote:
> 
>> > I don't mind if this is present but disabled by default, but then I
>> > guess it is not really serving much of a purpose, as hardly anybody
>> > would enable it. Which makes collecting large-scale statistics by
>> > hosting providers pretty much useless (and I don't think it is all that
>> > useful for debugging individual cases).
>> 
>> I agree that it should actually be disabled by default, for privacy
>> and security reasons, but that would actually defeat its purpose, so
>> I'm not really sure should it be merged.
> 
> One possibility is to send just the `sysname`, described as 'Operating
> system name (e.g., "Linux")', field of the struct utsname filled out
> by uname(2) by default.
> 
> It should be the same as what `uname -s` prints, so "Linux" for a
> Linux machine, and might be acceptable regarding privacy concerns.
> 
> And then there might be a knob to deactivate it completely or to make
> it more verbose (which might be useful for example in a corporate
> context).

I'd be fine with advertising "Linux" (or "Windows") only by default,
because it doesn't reveal much from the privacy and security standpoint,
but allows rather usable statistics to be collected.

A configuration knob that would allow it to be disabled entirely, or
be enabled with more details to be sent would also be fine with me.
Randall S. Becker June 19, 2024, 2:41 p.m. UTC | #6
On Wednesday, June 19, 2024 10:20 AM, Dragan Simic wrote:
>On 2024-06-19 16:01, Christian Couder wrote:
>> On Wed, Jun 19, 2024 at 3:50 PM Dragan Simic <dsimic@manjaro.org>
>> wrote:
>>
>>> > I don't mind if this is present but disabled by default, but then I
>>> > guess it is not really serving much of a purpose, as hardly anybody
>>> > would enable it. Which makes collecting large-scale statistics by
>>> > hosting providers pretty much useless (and I don't think it is all
>>> > that useful for debugging individual cases).
>>>
>>> I agree that it should actually be disabled by default, for privacy
>>> and security reasons, but that would actually defeat its purpose, so
>>> I'm not really sure should it be merged.
>>
>> One possibility is to send just the `sysname`, described as 'Operating
>> system name (e.g., "Linux")', field of the struct utsname filled out
>> by uname(2) by default.
>>
>> It should be the same as what `uname -s` prints, so "Linux" for a
>> Linux machine, and might be acceptable regarding privacy concerns.
>>
>> And then there might be a knob to deactivate it completely or to make
>> it more verbose (which might be useful for example in a corporate
>> context).
>
>I'd be fine with advertising "Linux" (or "Windows") only by default, because it
>doesn't reveal much from the privacy and security standpoint, but allows rather
>usable statistics to be collected.
>
>A configuration knob that would allow it to be disabled entirely, or be enabled with
>more details to be sent would also be fine with me.

While in the code, can I suggest including the OpenSSL version used in the build? This came up in at a customer a few weeks ago and they could not answer the question of what git build they were using. Turned out it used the wrong OpenSSL header compared to what they had installed.
Jeff King June 19, 2024, 2:50 p.m. UTC | #7
On Wed, Jun 19, 2024 at 04:01:57PM +0200, Christian Couder wrote:

> On Wed, Jun 19, 2024 at 3:50 PM Dragan Simic <dsimic@manjaro.org> wrote:
> 
> > > I don't mind if this is present but disabled by default, but then I
> > > guess it is not really serving much of a purpose, as hardly anybody
> > > would enable it. Which makes collecting large-scale statistics by
> > > hosting providers pretty much useless (and I don't think it is all that
> > > useful for debugging individual cases).
> >
> > I agree that it should actually be disabled by default, for privacy
> > and security reasons, but that would actually defeat its purpose, so
> > I'm not really sure should it be merged.
> 
> One possibility is to send just the `sysname`, described as 'Operating
> system name (e.g., "Linux")', field of the struct utsname filled out
> by uname(2) by default.

That would be better to me. I still don't love it, but I admit it's
coming more from a knee-jerk response than from some rational argument
against people knowing I run Linux.

Since HTTP user-agent fields are common, we can look at those for prior
art. curl sends its own version but nothing else. Most browsers do seem
to include some OS information. My version of firefox gives its own
version along with "Linux x86_64". So basically "uname -sm".

> And then there might be a knob to deactivate it completely or to make
> it more verbose (which might be useful for example in a corporate
> context).

Yes, I think we should definitely have an option to suppress or override
it, just like we do for the user-agent string.

-Peff
Jeff King June 19, 2024, 2:52 p.m. UTC | #8
On Wed, Jun 19, 2024 at 10:41:54AM -0400, rsbecker@nexbridge.com wrote:

> >A configuration knob that would allow it to be disabled entirely, or
> >be enabled with more details to be sent would also be fine with me.
> 
> While in the code, can I suggest including the OpenSSL version used in
> the build? This came up in at a customer a few weeks ago and they
> could not answer the question of what git build they were using.
> Turned out it used the wrong OpenSSL header compared to what they had
> installed.

At the point you are dealing one-on-one with somebody, I don't think
protocol-level messages like this are the best spot for debugging. But
it might make sense to teach "git version --build-options" to report the
openssl version.

-Peff
Randall S. Becker June 19, 2024, 2:57 p.m. UTC | #9
On Wednesday, June 19, 2024 10:53 AM, Peff wrote:
>On Wed, Jun 19, 2024 at 10:41:54AM -0400, rsbecker@nexbridge.com wrote:
>
>> >A configuration knob that would allow it to be disabled entirely, or
>> >be enabled with more details to be sent would also be fine with me.
>>
>> While in the code, can I suggest including the OpenSSL version used in
>> the build? This came up in at a customer a few weeks ago and they
>> could not answer the question of what git build they were using.
>> Turned out it used the wrong OpenSSL header compared to what they had
>> installed.
>
>At the point you are dealing one-on-one with somebody, I don't think protocol-level
>messages like this are the best spot for debugging. But it might make sense to teach
>"git version --build-options" to report the openssl version.

Much better idea. Will look into it. Thanks.
Randall S. Becker June 19, 2024, 3:25 p.m. UTC | #10
On Wednesday, June 19, 2024 10:51 AM, Peff wrote:
>On Wed, Jun 19, 2024 at 04:01:57PM +0200, Christian Couder wrote:
>
>> On Wed, Jun 19, 2024 at 3:50 PM Dragan Simic <dsimic@manjaro.org> wrote:
>>
>> > > I don't mind if this is present but disabled by default, but then
>> > > I guess it is not really serving much of a purpose, as hardly
>> > > anybody would enable it. Which makes collecting large-scale
>> > > statistics by hosting providers pretty much useless (and I don't
>> > > think it is all that useful for debugging individual cases).
>> >
>> > I agree that it should actually be disabled by default, for privacy
>> > and security reasons, but that would actually defeat its purpose, so
>> > I'm not really sure should it be merged.
>>
>> One possibility is to send just the `sysname`, described as 'Operating
>> system name (e.g., "Linux")', field of the struct utsname filled out
>> by uname(2) by default.
>
>That would be better to me. I still don't love it, but I admit it's coming more from a
>knee-jerk response than from some rational argument against people knowing I run
>Linux.
>
>Since HTTP user-agent fields are common, we can look at those for prior art. curl
>sends its own version but nothing else. Most browsers do seem to include some OS
>information. My version of firefox gives its own version along with "Linux x86_64".
>So basically "uname -sm".
>
>> And then there might be a knob to deactivate it completely or to make
>> it more verbose (which might be useful for example in a corporate
>> context).
>
>Yes, I think we should definitely have an option to suppress or override it, just like
>we do for the user-agent string.

Instead of an override, what about a knob that specifies the uname command to use to build the value. Personally, I would use `uname -s -r -v` on NonStop to get the kernel version used in the build. The difficulty on my platform is that this is not truly useful info. The effective build OS compatibility version is in a #define __L_Series_RVU and __H_Series_RVU, so the knob might be needed in git_compat_util.h or similar. This comes from the compiler arguments, which are not yet captured.
Dragan Simic June 19, 2024, 3:53 p.m. UTC | #11
On 2024-06-19 16:41, rsbecker@nexbridge.com wrote:
> On Wednesday, June 19, 2024 10:20 AM, Dragan Simic wrote:
>> On 2024-06-19 16:01, Christian Couder wrote:
>>> On Wed, Jun 19, 2024 at 3:50 PM Dragan Simic <dsimic@manjaro.org>
>>> wrote:
>>> 
>>>> > I don't mind if this is present but disabled by default, but then I
>>>> > guess it is not really serving much of a purpose, as hardly anybody
>>>> > would enable it. Which makes collecting large-scale statistics by
>>>> > hosting providers pretty much useless (and I don't think it is all
>>>> > that useful for debugging individual cases).
>>>> 
>>>> I agree that it should actually be disabled by default, for privacy
>>>> and security reasons, but that would actually defeat its purpose, so
>>>> I'm not really sure should it be merged.
>>> 
>>> One possibility is to send just the `sysname`, described as 
>>> 'Operating
>>> system name (e.g., "Linux")', field of the struct utsname filled out
>>> by uname(2) by default.
>>> 
>>> It should be the same as what `uname -s` prints, so "Linux" for a
>>> Linux machine, and might be acceptable regarding privacy concerns.
>>> 
>>> And then there might be a knob to deactivate it completely or to make
>>> it more verbose (which might be useful for example in a corporate
>>> context).
>> 
>> I'd be fine with advertising "Linux" (or "Windows") only by default, 
>> because it
>> doesn't reveal much from the privacy and security standpoint, but 
>> allows rather
>> usable statistics to be collected.
>> 
>> A configuration knob that would allow it to be disabled entirely, or 
>> be enabled with
>> more details to be sent would also be fine with me.
> 
> While in the code, can I suggest including the OpenSSL version used in
> the build? This came up in at a customer a few weeks ago and they
> could not answer the question of what git build they were using.
> Turned out it used the wrong OpenSSL header compared to what they had
> installed.

Makes sense to me, but only in the non-default "advertise more details"
mode of the new configuration knob.
brian m. carlson June 19, 2024, 10:46 p.m. UTC | #12
On 2024-06-19 at 14:50:42, Jeff King wrote:
> On Wed, Jun 19, 2024 at 04:01:57PM +0200, Christian Couder wrote:
> 
> > One possibility is to send just the `sysname`, described as 'Operating
> > system name (e.g., "Linux")', field of the struct utsname filled out
> > by uname(2) by default.
> 
> That would be better to me. I still don't love it, but I admit it's
> coming more from a knee-jerk response than from some rational argument
> against people knowing I run Linux.
> 
> Since HTTP user-agent fields are common, we can look at those for prior
> art. curl sends its own version but nothing else. Most browsers do seem
> to include some OS information. My version of firefox gives its own
> version along with "Linux x86_64". So basically "uname -sm".

If we choose to enable this, this is the right level of detail, yeah.
We could also allow a distributor to set this value at compile time,
much like Debian does for Postfix and OpenSSH.  For Postfix, it's simply
"(Debian)", which doesn't give much information.

To me as a server administrator interested in statistics, it's useful to
me to know OS name and version (as in, how many users are still using an
ancient version of CentOS?), since that tells me about things like
supported TLS versions which is helpful, but as a user I don't think
that's an appropriate level of detail to share.  And I also worry about
fingerprinting and tracking, which is a giant problem with HTTP
user-agents.  This is especially true if you're using something like
FreeBSD RISC-V, which is just not that common.

> > And then there might be a knob to deactivate it completely or to make
> > it more verbose (which might be useful for example in a corporate
> > context).
> 
> Yes, I think we should definitely have an option to suppress or override
> it, just like we do for the user-agent string.

I definitely think we should have both.  I'm sure we'll have some server
maintainer or repository administrator who tries to reject "bad" OSes
(like someone who doesn't like their employees using WSL, for example).
We've already had people propose to reject access based on the version
number in the name of "security", despite the fact that most Linux
distros just backport security patches and thus the version number is
not usually interesting in that regard.  Again, HTTP user-agents tell us
that people will make access control decisions here even though they
should not.

We'll want to honour people's decisions to remain a mystery or to work
around broken server implementations, or just to make it harder to track
or fingerprint them.

I also think the documentation should state that for the user-agent and
os-version fields that they are merely informative, can be changed, and
MUST NOT be used for access control.  That doesn't mean people will
honour it, but it does mean that we can and should feel free to break
implementations that don't comply.
Jeff King June 20, 2024, 3:25 p.m. UTC | #13
On Wed, Jun 19, 2024 at 10:46:13PM +0000, brian m. carlson wrote:

> We'll want to honour people's decisions to remain a mystery or to work
> around broken server implementations, or just to make it harder to track
> or fingerprint them.

Yep, I agree with everything you said, especially this part.

> I also think the documentation should state that for the user-agent and
> os-version fields that they are merely informative, can be changed, and
> MUST NOT be used for access control.  That doesn't mean people will
> honour it, but it does mean that we can and should feel free to break
> implementations that don't comply.

I think Christian's proposed documentation did have something along
these lines.

I do kind of wonder if we even need a separate "os-version" field, and
if it couldn't simply be plugged into the user-agent string (making it
"git/1.2.3 Linux x86_64" or something). But maybe that introduces more
hassles with respect to configuring/overriding the two parts separately.

-Peff
Junio C Hamano June 20, 2024, 4:28 p.m. UTC | #14
Christian Couder <christian.couder@gmail.com> writes:

> For debugging and statistical purposes, it can be useful for Git
> servers to know the OS the client are using.
>
> So let's add a new 'os-version' capability to the v2 protocol, in the
> same way as the existing 'agent' capability that lets clients and
> servers exchange the Git version they are running.
>
> This sends the same info as `git bugreport` is already sending, which
> uses uname(2). It should be the same as what `uname -srvm` returns,
> except that it is sanitized in the same way as the Git version sent by
> the 'agent' capability is sanitized (by replacing character having an
> ascii code less than 32 or more than 127 with '.').
>
> CI tests are currently failing on Windows as it looks like uname(1)
> and uname(2) don't report the same thing:
>
>   -os-version=MINGW64_NT-10.0-20348.3.4.10-87d57229.x86_64.2024-02-14.20:17.UTC.x86_64
>   +os-version=Windows.10.0.20348
>
> (See: https://github.com/chriscool/git/actions/runs/9581822699)

I think we already heard from enough people to cover the spectrum of
opinions.  I'd have to say that needs to be carefully kept to the
minimum what we send in the 'user-agent' like manner.  The "git
bugreport" is an opt-in "these should help you in helping me"
feature, designed to allow further redacting by the user before
sending it out, and should not be compared with "on by default for
everybody" telemetry data.

I personally like the idea to add to user-agent, instead of adding a
new capability.  What is the true motivation behind this?  Is this
thing meant to gather statistics from potentially non-paying general
public from hosting providers, or is this primarily for $CORP IT
folks to make sure that nobody is being too stale?