diff mbox

Regression: drm: Lobotomize set_busid nonsense for !pci drivers (a325725633c2)

Message ID 9d5cd002-eb29-bc4d-89e8-a4a594dd84ad@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Hans de Goede Sept. 30, 2016, 10:35 a.m. UTC
Hi,

On 30-09-16 10:28, Hans de Goede wrote:
> Hi,
>

<snip>

> Xorg when running without a Xorg.conf searches for what it considers
> a "primary" gpu / video-card, basically it attempts to bring up the
> right card in setups where there are multiple cards and if it does not
> find one exits with an error.
>
> The xserver has a 2 step process for finding the primary card:
>
> 1) It searches for is a card which has a vga-bios mapped,
> as we've already determined in the mentioned Red Hat bug that works for
> the classic qemu emulated video-cards, but not for qemu's virtio-vga.
>
> 2) If that does not work Xorg will fallback to any video class device
> on pci-bus 1.
>
> This fallback actually has been broken in the Xorg xserver for quite a
> while now and only 2 days ago a patch from Laszlo was merged to fix this.
>
> Only for things to break again due to this kernel patch.
>
> Since the whole step 2) thingie is very much tied to x86 machines
> where pci-bus 0 used to be the main bus and pci-bus 1 the agp,
> which is sorta an obsolete assumption now a days and  since relying
> on bus numbers / enumeration order is a bad idea in general I'm not
> entirely sure if this counts as a regression.
>
> I've discussed the problem of the xserver exiting with an error when
> no primary device can be found with some people (ajax) at XDC last week
> since there are other use-cases where the pci-bus 1 fallback does not
> work.
>
> As such I've been working on a xserver patch-set to make the xserver
> try harder (pick the first available device) when both steps described
> above fail to find one, which should make things work even with the
> newest (broken / regressed) kernels.
>
> Given this mail thread, I guess I'm working after all today (I had
> planned a day off) and I'll try to wrap up this patch-set and reply
> to this mail with the server patches attached for Joachim and/or
> Laszlo to test.

Attached are 2 patches against the xserver which should fix this,
please give them a try.

Regards,

Hans

Comments

Laszlo Ersek Sept. 30, 2016, 2:51 p.m. UTC | #1
On 09/30/16 12:35, Hans de Goede wrote:

> Attached are 2 patches against the xserver which should fix this,
> please give them a try.

Sorry about the delay.

The patches don't seem to fix the issue for me. Please see the Xorg log
attached.

I tested the patches as follows. Given that my bisection had been done
in a Fedora 24 guest, using

  xorg-x11-server-1.18.4-4.fc24
  http://koji.fedoraproject.org/koji/buildinfo?buildID=794494

I now rebuilt the guest kernel exactly at the failing commit (a325725
"drm: Lobotomize set_busid nonsense for !pci drivers"), and first
reproduced the issue with the above X server.

Then, I ported your patches to "xorg-server-1.18.4" (using the upstream
xserver tree), and rebuilt the Fedora package with the backport. For the
backport, I had to cherry-pick the following two patches from master first:

1 ca8d88e50310 xfree86: recognize primary BUS_PCI device in
               xf86IsPrimaryPlatform()
2 ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID

This way your patches applied cleanly. (Cherry pick #1 above is actually
necessary for semantics, while cherry pick #2 is needed for a clean
context only, and has no impact for this test.)

That is, in total, I added the following four patches to the Fedora 24
package:

1 xfree86: recognize primary BUS_PCI device in xf86IsPrimaryPlatform()
2 config: fix GPUDevice fail when AutoAddGPU off + BusID
3 xfree86: Make adding unclaimed devices as GPU devices a separate step
4 xfree86: Try harder to find atleast 1 non GPU Screen

You can find the scratch build that I used for testing here:

  xorg-x11-server-1.18.4-4.hans_bz1366842_2.fc24
  http://koji.fedoraproject.org/koji/taskinfo?taskID=15875087

Another reason I used F24's X server as basis, rather than upstream
HEAD, is that Fedora 24 is pretty young, and it's already on kernel
4.7.4, and I believe it will soon move to kernel 4.8, without
(necessarily) rebasing its X server package to upstream. IOW the kernel
upgrade to 4.8 will break X in Fedora 24 too, and then I expect the
Fedora X maintainers would have to cherry pick those two patches as
dependencies just the same.

To summarize, the patches don't seem to help. I shall nonetheless thank
you for spending your Friday on this!

Laszlo
Hans de Goede Sept. 30, 2016, 2:59 p.m. UTC | #2
Hi,

On 30-09-16 16:51, Laszlo Ersek wrote:
> On 09/30/16 12:35, Hans de Goede wrote:
>
>> Attached are 2 patches against the xserver which should fix this,
>> please give them a try.
>
> Sorry about the delay.
>
> The patches don't seem to fix the issue for me. Please see the Xorg log
> attached.
>
> I tested the patches as follows. Given that my bisection had been done
> in a Fedora 24 guest, using
>
>   xorg-x11-server-1.18.4-4.fc24
>   http://koji.fedoraproject.org/koji/buildinfo?buildID=794494
>
> I now rebuilt the guest kernel exactly at the failing commit (a325725
> "drm: Lobotomize set_busid nonsense for !pci drivers"), and first
> reproduced the issue with the above X server.
>
> Then, I ported your patches to "xorg-server-1.18.4" (using the upstream
> xserver tree), and rebuilt the Fedora package with the backport. For the
> backport, I had to cherry-pick the following two patches from master first:
>
> 1 ca8d88e50310 xfree86: recognize primary BUS_PCI device in
>                xf86IsPrimaryPlatform()
> 2 ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID
>
> This way your patches applied cleanly. (Cherry pick #1 above is actually
> necessary for semantics, while cherry pick #2 is needed for a clean
> context only, and has no impact for this test.)
>
> That is, in total, I added the following four patches to the Fedora 24
> package:
>
> 1 xfree86: recognize primary BUS_PCI device in xf86IsPrimaryPlatform()
> 2 config: fix GPUDevice fail when AutoAddGPU off + BusID
> 3 xfree86: Make adding unclaimed devices as GPU devices a separate step
> 4 xfree86: Try harder to find atleast 1 non GPU Screen
>
> You can find the scratch build that I used for testing here:
>
>   xorg-x11-server-1.18.4-4.hans_bz1366842_2.fc24
>   http://koji.fedoraproject.org/koji/taskinfo?taskID=15875087
>
> Another reason I used F24's X server as basis, rather than upstream
> HEAD, is that Fedora 24 is pretty young, and it's already on kernel
> 4.7.4, and I believe it will soon move to kernel 4.8, without
> (necessarily) rebasing its X server package to upstream. IOW the kernel
> upgrade to 4.8 will break X in Fedora 24 too, and then I expect the
> Fedora X maintainers would have to cherry pick those two patches as
> dependencies just the same.
>
> To summarize, the patches don't seem to help. I shall nonetheless thank
> you for spending your Friday on this!

Hmm, do you have a xorg.conf file lying around somewhere, the message
about the xserver not being able to find an entry for screen 0 does
not make sense ...

Anyways, Monday is another day we can look at this ...

Regards,

Hans
Laszlo Ersek Sept. 30, 2016, 3:33 p.m. UTC | #3
On 09/30/16 16:59, Hans de Goede wrote:
> Hi,
> 
> On 30-09-16 16:51, Laszlo Ersek wrote:
>> On 09/30/16 12:35, Hans de Goede wrote:
>>
>>> Attached are 2 patches against the xserver which should fix this,
>>> please give them a try.
>>
>> Sorry about the delay.
>>
>> The patches don't seem to fix the issue for me. Please see the Xorg log
>> attached.
>>
>> I tested the patches as follows. Given that my bisection had been done
>> in a Fedora 24 guest, using
>>
>>   xorg-x11-server-1.18.4-4.fc24
>>   http://koji.fedoraproject.org/koji/buildinfo?buildID=794494
>>
>> I now rebuilt the guest kernel exactly at the failing commit (a325725
>> "drm: Lobotomize set_busid nonsense for !pci drivers"), and first
>> reproduced the issue with the above X server.
>>
>> Then, I ported your patches to "xorg-server-1.18.4" (using the upstream
>> xserver tree), and rebuilt the Fedora package with the backport. For the
>> backport, I had to cherry-pick the following two patches from master
>> first:
>>
>> 1 ca8d88e50310 xfree86: recognize primary BUS_PCI device in
>>                xf86IsPrimaryPlatform()
>> 2 ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>
>> This way your patches applied cleanly. (Cherry pick #1 above is actually
>> necessary for semantics, while cherry pick #2 is needed for a clean
>> context only, and has no impact for this test.)
>>
>> That is, in total, I added the following four patches to the Fedora 24
>> package:
>>
>> 1 xfree86: recognize primary BUS_PCI device in xf86IsPrimaryPlatform()
>> 2 config: fix GPUDevice fail when AutoAddGPU off + BusID
>> 3 xfree86: Make adding unclaimed devices as GPU devices a separate step
>> 4 xfree86: Try harder to find atleast 1 non GPU Screen
>>
>> You can find the scratch build that I used for testing here:
>>
>>   xorg-x11-server-1.18.4-4.hans_bz1366842_2.fc24
>>   http://koji.fedoraproject.org/koji/taskinfo?taskID=15875087
>>
>> Another reason I used F24's X server as basis, rather than upstream
>> HEAD, is that Fedora 24 is pretty young, and it's already on kernel
>> 4.7.4, and I believe it will soon move to kernel 4.8, without
>> (necessarily) rebasing its X server package to upstream. IOW the kernel
>> upgrade to 4.8 will break X in Fedora 24 too, and then I expect the
>> Fedora X maintainers would have to cherry pick those two patches as
>> dependencies just the same.
>>
>> To summarize, the patches don't seem to help. I shall nonetheless thank
>> you for spending your Friday on this!
> 
> Hmm, do you have a xorg.conf file lying around somewhere, the message
> about the xserver not being able to find an entry for screen 0 does
> not make sense ...

Good catch, I actually had two files under "/etc/X11/xorg.conf.d/":

* "00-keyboard.conf", from package "systemd-229-13.fc24.x86_64", with
contents

------------
# Read and parsed by systemd-localed. It's probably wise not to edit
this file
# manually too freely.
Section "InputClass"
        Identifier "system-keyboard"
        MatchIsKeyboard "on"
        Option "XkbLayout" "us"
EndSection
------------

* "01-resolution.conf", which I had created, in order to set the
preferred display resolution:

------------
Section "Screen"
  Identifier "Default Screen"
  Device     "Default Device"
  Monitor    "Default Monitor"
EndSection

Section "Device"
  Identifier "Default Device"
  Driver     "modesetting"
EndSection

Section "Monitor"
  Identifier "Default Monitor"
  Option     "PreferredMode"   "640x480"
# Option     "PreferredMode"   "1440x900"
EndSection
------------

I removed these files now, and repeated the test. Again, the X server
wouldn't start, but I think the log file looks a bit different now.
Attached.

> Anyways, Monday is another day we can look at this ...

Sure; many thanks!
Laszlo
Hans de Goede Sept. 30, 2016, 4:38 p.m. UTC | #4
Hi,

On 30-09-16 17:33, Laszlo Ersek wrote:
> On 09/30/16 16:59, Hans de Goede wrote:
>> Hi,
>>
>> On 30-09-16 16:51, Laszlo Ersek wrote:
>>> On 09/30/16 12:35, Hans de Goede wrote:
>>>
>>>> Attached are 2 patches against the xserver which should fix this,
>>>> please give them a try.
>>>
>>> Sorry about the delay.
>>>
>>> The patches don't seem to fix the issue for me. Please see the Xorg log
>>> attached.
>>>
>>> I tested the patches as follows. Given that my bisection had been done
>>> in a Fedora 24 guest, using
>>>
>>>   xorg-x11-server-1.18.4-4.fc24
>>>   http://koji.fedoraproject.org/koji/buildinfo?buildID=794494
>>>
>>> I now rebuilt the guest kernel exactly at the failing commit (a325725
>>> "drm: Lobotomize set_busid nonsense for !pci drivers"), and first
>>> reproduced the issue with the above X server.
>>>
>>> Then, I ported your patches to "xorg-server-1.18.4" (using the upstream
>>> xserver tree), and rebuilt the Fedora package with the backport. For the
>>> backport, I had to cherry-pick the following two patches from master
>>> first:
>>>
>>> 1 ca8d88e50310 xfree86: recognize primary BUS_PCI device in
>>>                xf86IsPrimaryPlatform()
>>> 2 ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>
>>> This way your patches applied cleanly. (Cherry pick #1 above is actually
>>> necessary for semantics, while cherry pick #2 is needed for a clean
>>> context only, and has no impact for this test.)
>>>
>>> That is, in total, I added the following four patches to the Fedora 24
>>> package:
>>>
>>> 1 xfree86: recognize primary BUS_PCI device in xf86IsPrimaryPlatform()
>>> 2 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>> 3 xfree86: Make adding unclaimed devices as GPU devices a separate step
>>> 4 xfree86: Try harder to find atleast 1 non GPU Screen
>>>
>>> You can find the scratch build that I used for testing here:
>>>
>>>   xorg-x11-server-1.18.4-4.hans_bz1366842_2.fc24
>>>   http://koji.fedoraproject.org/koji/taskinfo?taskID=15875087
>>>
>>> Another reason I used F24's X server as basis, rather than upstream
>>> HEAD, is that Fedora 24 is pretty young, and it's already on kernel
>>> 4.7.4, and I believe it will soon move to kernel 4.8, without
>>> (necessarily) rebasing its X server package to upstream. IOW the kernel
>>> upgrade to 4.8 will break X in Fedora 24 too, and then I expect the
>>> Fedora X maintainers would have to cherry pick those two patches as
>>> dependencies just the same.
>>>
>>> To summarize, the patches don't seem to help. I shall nonetheless thank
>>> you for spending your Friday on this!
>>
>> Hmm, do you have a xorg.conf file lying around somewhere, the message
>> about the xserver not being able to find an entry for screen 0 does
>> not make sense ...
>
> Good catch, I actually had two files under "/etc/X11/xorg.conf.d/":
>
> * "00-keyboard.conf", from package "systemd-229-13.fc24.x86_64", with
> contents
>
> ------------
> # Read and parsed by systemd-localed. It's probably wise not to edit
> this file
> # manually too freely.
> Section "InputClass"
>         Identifier "system-keyboard"
>         MatchIsKeyboard "on"
>         Option "XkbLayout" "us"
> EndSection
> ------------
>
> * "01-resolution.conf", which I had created, in order to set the
> preferred display resolution:
>
> ------------
> Section "Screen"
>   Identifier "Default Screen"
>   Device     "Default Device"
>   Monitor    "Default Monitor"
> EndSection
>
> Section "Device"
>   Identifier "Default Device"
>   Driver     "modesetting"
> EndSection
>
> Section "Monitor"
>   Identifier "Default Monitor"
>   Option     "PreferredMode"   "640x480"
> # Option     "PreferredMode"   "1440x900"
> EndSection
> ------------
>
> I removed these files now, and repeated the test. Again, the X server
> wouldn't start, but I think the log file looks a bit different now.
> Attached.

Ah, ok so it seems that my initial analysis is wrong, the problem
is not a re-occuring of the device getting identified as a GPU screen,
libdrm sorta depends on bus-ids and the lack of one is causing the
server to misbehave. I guess that even with a xorg.conf things
will fail with the troublesome kernel version (might be worth
trying).

Emil's analysis seems to be spot on. This does not seem easily
fixable in userspace / does seem like a real regression as it
even breaks things when specifying the device through xorg.conf
(I or so I believe) which is something which uses to work ...

I made the mistake of thinking the kernel change was re-triggering
the old problem Laszlo fixed, but that does not seem to be the
case.

Regards,

Hans
Laszlo Ersek Sept. 30, 2016, 5:26 p.m. UTC | #5
On 09/30/16 18:38, Hans de Goede wrote:
> Hi,
> 
> On 30-09-16 17:33, Laszlo Ersek wrote:
>> On 09/30/16 16:59, Hans de Goede wrote:
>>> Hi,
>>>
>>> On 30-09-16 16:51, Laszlo Ersek wrote:
>>>> On 09/30/16 12:35, Hans de Goede wrote:
>>>>
>>>>> Attached are 2 patches against the xserver which should fix this,
>>>>> please give them a try.
>>>>
>>>> Sorry about the delay.
>>>>
>>>> The patches don't seem to fix the issue for me. Please see the Xorg log
>>>> attached.
>>>>
>>>> I tested the patches as follows. Given that my bisection had been done
>>>> in a Fedora 24 guest, using
>>>>
>>>>   xorg-x11-server-1.18.4-4.fc24
>>>>   http://koji.fedoraproject.org/koji/buildinfo?buildID=794494
>>>>
>>>> I now rebuilt the guest kernel exactly at the failing commit (a325725
>>>> "drm: Lobotomize set_busid nonsense for !pci drivers"), and first
>>>> reproduced the issue with the above X server.
>>>>
>>>> Then, I ported your patches to "xorg-server-1.18.4" (using the upstream
>>>> xserver tree), and rebuilt the Fedora package with the backport. For
>>>> the
>>>> backport, I had to cherry-pick the following two patches from master
>>>> first:
>>>>
>>>> 1 ca8d88e50310 xfree86: recognize primary BUS_PCI device in
>>>>                xf86IsPrimaryPlatform()
>>>> 2 ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>>
>>>> This way your patches applied cleanly. (Cherry pick #1 above is
>>>> actually
>>>> necessary for semantics, while cherry pick #2 is needed for a clean
>>>> context only, and has no impact for this test.)
>>>>
>>>> That is, in total, I added the following four patches to the Fedora 24
>>>> package:
>>>>
>>>> 1 xfree86: recognize primary BUS_PCI device in xf86IsPrimaryPlatform()
>>>> 2 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>> 3 xfree86: Make adding unclaimed devices as GPU devices a separate step
>>>> 4 xfree86: Try harder to find atleast 1 non GPU Screen
>>>>
>>>> You can find the scratch build that I used for testing here:
>>>>
>>>>   xorg-x11-server-1.18.4-4.hans_bz1366842_2.fc24
>>>>   http://koji.fedoraproject.org/koji/taskinfo?taskID=15875087
>>>>
>>>> Another reason I used F24's X server as basis, rather than upstream
>>>> HEAD, is that Fedora 24 is pretty young, and it's already on kernel
>>>> 4.7.4, and I believe it will soon move to kernel 4.8, without
>>>> (necessarily) rebasing its X server package to upstream. IOW the kernel
>>>> upgrade to 4.8 will break X in Fedora 24 too, and then I expect the
>>>> Fedora X maintainers would have to cherry pick those two patches as
>>>> dependencies just the same.
>>>>
>>>> To summarize, the patches don't seem to help. I shall nonetheless thank
>>>> you for spending your Friday on this!
>>>
>>> Hmm, do you have a xorg.conf file lying around somewhere, the message
>>> about the xserver not being able to find an entry for screen 0 does
>>> not make sense ...
>>
>> Good catch, I actually had two files under "/etc/X11/xorg.conf.d/":
>>
>> * "00-keyboard.conf", from package "systemd-229-13.fc24.x86_64", with
>> contents
>>
>> ------------
>> # Read and parsed by systemd-localed. It's probably wise not to edit
>> this file
>> # manually too freely.
>> Section "InputClass"
>>         Identifier "system-keyboard"
>>         MatchIsKeyboard "on"
>>         Option "XkbLayout" "us"
>> EndSection
>> ------------
>>
>> * "01-resolution.conf", which I had created, in order to set the
>> preferred display resolution:
>>
>> ------------
>> Section "Screen"
>>   Identifier "Default Screen"
>>   Device     "Default Device"
>>   Monitor    "Default Monitor"
>> EndSection
>>
>> Section "Device"
>>   Identifier "Default Device"
>>   Driver     "modesetting"
>> EndSection
>>
>> Section "Monitor"
>>   Identifier "Default Monitor"
>>   Option     "PreferredMode"   "640x480"
>> # Option     "PreferredMode"   "1440x900"
>> EndSection
>> ------------
>>
>> I removed these files now, and repeated the test. Again, the X server
>> wouldn't start, but I think the log file looks a bit different now.
>> Attached.
> 
> Ah, ok so it seems that my initial analysis is wrong, the problem
> is not a re-occuring of the device getting identified as a GPU screen,
> libdrm sorta depends on bus-ids and the lack of one is causing the
> server to misbehave. I guess that even with a xorg.conf things
> will fail with the troublesome kernel version (might be worth
> trying).
> 
> Emil's analysis seems to be spot on. This does not seem easily
> fixable in userspace / does seem like a real regression as it
> even breaks things when specifying the device through xorg.conf
> (I or so I believe) which is something which uses to work ...

In order to check this hypothesis, I did the following:
- I downgraded my xorg-x11-server installation to the most recent
official F24 packages, that is, "1.18.4-4.fc24",
- I kept the kernel that I built exactly at the regressive commit
(a325725633c2)
- I modified "01-resolution.conf" (see it above in the context) like this:

----
Section "Device"
  Identifier "Default Device"
  Driver     "modesetting"
  BusID      "PCI:00:02:0" <------------ new option added
EndSection
----

where BusID matches the B/D/F of the virtio-vga device from "lspci".

This setup (modulo the kernel of course) was known to work, but now the
X server actually segfaults (apparently in the
xf86PlatformDeviceCheckBusID() function). Please find the logfile attached.

(NB: this is unrelated to upstream commit de9ce6757c2e -- which the
pristine FC24 build lacks -- because I don't set AutoAddGPU to "off" --
it is left at its default "on" value.)

Therefore, you are right. :)

Thanks
Laszlo

> I made the mistake of thinking the kernel change was re-triggering
> the old problem Laszlo fixed, but that does not seem to be the
> case.
> 
> Regards,
> 
> Hans
Emil Velikov Oct. 3, 2016, 11:34 a.m. UTC | #6
On 30 September 2016 at 18:26, Laszlo Ersek <lersek@redhat.com> wrote:
> On 09/30/16 18:38, Hans de Goede wrote:
>> Hi,
>>
>> On 30-09-16 17:33, Laszlo Ersek wrote:
>>> On 09/30/16 16:59, Hans de Goede wrote:
>>>> Hi,
>>>>
>>>> On 30-09-16 16:51, Laszlo Ersek wrote:
>>>>> On 09/30/16 12:35, Hans de Goede wrote:
>>>>>
>>>>>> Attached are 2 patches against the xserver which should fix this,
>>>>>> please give them a try.
>>>>>
>>>>> Sorry about the delay.
>>>>>
>>>>> The patches don't seem to fix the issue for me. Please see the Xorg log
>>>>> attached.
>>>>>
>>>>> I tested the patches as follows. Given that my bisection had been done
>>>>> in a Fedora 24 guest, using
>>>>>
>>>>>   xorg-x11-server-1.18.4-4.fc24
>>>>>   http://koji.fedoraproject.org/koji/buildinfo?buildID=794494
>>>>>
>>>>> I now rebuilt the guest kernel exactly at the failing commit (a325725
>>>>> "drm: Lobotomize set_busid nonsense for !pci drivers"), and first
>>>>> reproduced the issue with the above X server.
>>>>>
>>>>> Then, I ported your patches to "xorg-server-1.18.4" (using the upstream
>>>>> xserver tree), and rebuilt the Fedora package with the backport. For
>>>>> the
>>>>> backport, I had to cherry-pick the following two patches from master
>>>>> first:
>>>>>
>>>>> 1 ca8d88e50310 xfree86: recognize primary BUS_PCI device in
>>>>>                xf86IsPrimaryPlatform()
>>>>> 2 ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>>>
>>>>> This way your patches applied cleanly. (Cherry pick #1 above is
>>>>> actually
>>>>> necessary for semantics, while cherry pick #2 is needed for a clean
>>>>> context only, and has no impact for this test.)
>>>>>
>>>>> That is, in total, I added the following four patches to the Fedora 24
>>>>> package:
>>>>>
>>>>> 1 xfree86: recognize primary BUS_PCI device in xf86IsPrimaryPlatform()
>>>>> 2 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>>> 3 xfree86: Make adding unclaimed devices as GPU devices a separate step
>>>>> 4 xfree86: Try harder to find atleast 1 non GPU Screen
>>>>>
>>>>> You can find the scratch build that I used for testing here:
>>>>>
>>>>>   xorg-x11-server-1.18.4-4.hans_bz1366842_2.fc24
>>>>>   http://koji.fedoraproject.org/koji/taskinfo?taskID=15875087
>>>>>
>>>>> Another reason I used F24's X server as basis, rather than upstream
>>>>> HEAD, is that Fedora 24 is pretty young, and it's already on kernel
>>>>> 4.7.4, and I believe it will soon move to kernel 4.8, without
>>>>> (necessarily) rebasing its X server package to upstream. IOW the kernel
>>>>> upgrade to 4.8 will break X in Fedora 24 too, and then I expect the
>>>>> Fedora X maintainers would have to cherry pick those two patches as
>>>>> dependencies just the same.
>>>>>
>>>>> To summarize, the patches don't seem to help. I shall nonetheless thank
>>>>> you for spending your Friday on this!
>>>>
>>>> Hmm, do you have a xorg.conf file lying around somewhere, the message
>>>> about the xserver not being able to find an entry for screen 0 does
>>>> not make sense ...
>>>
>>> Good catch, I actually had two files under "/etc/X11/xorg.conf.d/":
>>>
>>> * "00-keyboard.conf", from package "systemd-229-13.fc24.x86_64", with
>>> contents
>>>
>>> ------------
>>> # Read and parsed by systemd-localed. It's probably wise not to edit
>>> this file
>>> # manually too freely.
>>> Section "InputClass"
>>>         Identifier "system-keyboard"
>>>         MatchIsKeyboard "on"
>>>         Option "XkbLayout" "us"
>>> EndSection
>>> ------------
>>>
>>> * "01-resolution.conf", which I had created, in order to set the
>>> preferred display resolution:
>>>
>>> ------------
>>> Section "Screen"
>>>   Identifier "Default Screen"
>>>   Device     "Default Device"
>>>   Monitor    "Default Monitor"
>>> EndSection
>>>
>>> Section "Device"
>>>   Identifier "Default Device"
>>>   Driver     "modesetting"
>>> EndSection
>>>
>>> Section "Monitor"
>>>   Identifier "Default Monitor"
>>>   Option     "PreferredMode"   "640x480"
>>> # Option     "PreferredMode"   "1440x900"
>>> EndSection
>>> ------------
>>>
>>> I removed these files now, and repeated the test. Again, the X server
>>> wouldn't start, but I think the log file looks a bit different now.
>>> Attached.
>>
>> Ah, ok so it seems that my initial analysis is wrong, the problem
>> is not a re-occuring of the device getting identified as a GPU screen,
>> libdrm sorta depends on bus-ids and the lack of one is causing the
>> server to misbehave. I guess that even with a xorg.conf things
>> will fail with the troublesome kernel version (might be worth
>> trying).
>>
>> Emil's analysis seems to be spot on. This does not seem easily
>> fixable in userspace / does seem like a real regression as it
>> even breaks things when specifying the device through xorg.conf
>> (I or so I believe) which is something which uses to work ...
>
> In order to check this hypothesis, I did the following:
> - I downgraded my xorg-x11-server installation to the most recent
> official F24 packages, that is, "1.18.4-4.fc24",
> - I kept the kernel that I built exactly at the regressive commit
> (a325725633c2)
> - I modified "01-resolution.conf" (see it above in the context) like this:
>
> ----
> Section "Device"
>   Identifier "Default Device"
>   Driver     "modesetting"
>   BusID      "PCI:00:02:0" <------------ new option added
> EndSection
> ----
>
> where BusID matches the B/D/F of the virtio-vga device from "lspci".
>
> This setup (modulo the kernel of course) was known to work, but now the
> X server actually segfaults (apparently in the
> xf86PlatformDeviceCheckBusID() function). Please find the logfile attached.
>
> (NB: this is unrelated to upstream commit de9ce6757c2e -- which the
> pristine FC24 build lacks -- because I don't set AutoAddGPU to "off" --
> it is left at its default "on" value.)
>
Where is this upstream commit again - it shows as unknown for the
kernel, xserver and libdrm ?

So my theory was a bit off - SetVersion is the one responsible to set
the "BusID", as retrieved by drmGetBusID, regardless if drmOpen or
open is used.

Here's a bit of a brain dump from the other day:

 - The commit mentioned 'affects' the drmSetBusid/DRM_IOCTL_SET_UNIQUE
userspace codepaths.
 - The latter itself is dri1/legacy (xserver hw/xfree86/dri/) which is
not functional for platform devices.
The latter of which seems to be the case for virt-gpu based on the
kernel module.
 - The modesetting driver should/cannot reach the above xserver codepath

That said, it seems that (at least some) userspace expects a PCI
device despite the kernel module 'advertising' itself as platform one
:-\

Going through the xserver layers is a bit inspiring I'm wondering if
we can not get a strace before/after the xserver commit
ca8d88e50310a0d440a127c22a0a383cc149f408 ? It will help us track
things a lot quicker/easier.


Thanks
Emil
Laszlo Ersek Oct. 3, 2016, 12:14 p.m. UTC | #7
On 10/03/16 13:34, Emil Velikov wrote:
> On 30 September 2016 at 18:26, Laszlo Ersek <lersek@redhat.com> wrote:

>> This setup (modulo the kernel of course) was known to work, but now the
>> X server actually segfaults (apparently in the
>> xf86PlatformDeviceCheckBusID() function). Please find the logfile attached.
>>
>> (NB: this is unrelated to upstream commit de9ce6757c2e -- which the
>> pristine FC24 build lacks -- because I don't set AutoAddGPU to "off" --
>> it is left at its default "on" value.)
>>
> Where is this upstream commit again - it shows as unknown for the
> kernel, xserver and libdrm ?

Apologies, that commit hash belongs to a cherry-picked patch on one of
my private branches; the original is

  ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID


> So my theory was a bit off - SetVersion is the one responsible to set
> the "BusID", as retrieved by drmGetBusID, regardless if drmOpen or
> open is used.
> 
> Here's a bit of a brain dump from the other day:
> 
>  - The commit mentioned

(Here I assume you mean kernel commit

  a325725633c2 drm: Lobotomize set_busid nonsense for !pci drivers

again.)

>'affects' the drmSetBusid/DRM_IOCTL_SET_UNIQUE
> userspace codepaths.
>  - The latter itself is dri1/legacy (xserver hw/xfree86/dri/) which is
> not functional for platform devices.
> The latter of which seems to be the case for virt-gpu based on the
> kernel module.
>  - The modesetting driver should/cannot reach the above xserver codepath
> 
> That said, it seems that (at least some) userspace expects a PCI
> device despite the kernel module 'advertising' itself as platform one
> :-\
> 
> Going through the xserver layers is a bit inspiring I'm wondering if
> we can not get a strace before/after the xserver commit
> ca8d88e50310a0d440a127c22a0a383cc149f408 ? It will help us track
> things a lot quicker/easier.

This commit:

  ca8d88e50310 xfree86: recognize primary BUS_PCI device in
               xf86IsPrimaryPlatform()

is not relevant when running Xorg on QEMU's "virtio-vga" device, it is
only relevant when the "virtio-gpu-pci" device is used. Aarch64 guests
can't have "virtio-vga", only "virgio-gpu-pci", so that commit was
needed for Aarch64 guests primarily.

In this case however, kernel commit a325725633c2 breaks "virtio-vga" in
x86_64 guests, both with and without xserver commit ca8d88e50310.

Anyway I'll look into capturing a strace and/or following Xorg with gdb.

Thanks!
Laszlo
Emil Velikov Oct. 3, 2016, 12:46 p.m. UTC | #8
On 3 October 2016 at 13:14, Laszlo Ersek <lersek@redhat.com> wrote:
> On 10/03/16 13:34, Emil Velikov wrote:
>> On 30 September 2016 at 18:26, Laszlo Ersek <lersek@redhat.com> wrote:
>
>>> This setup (modulo the kernel of course) was known to work, but now the
>>> X server actually segfaults (apparently in the
>>> xf86PlatformDeviceCheckBusID() function). Please find the logfile attached.
>>>
>>> (NB: this is unrelated to upstream commit de9ce6757c2e -- which the
>>> pristine FC24 build lacks -- because I don't set AutoAddGPU to "off" --
>>> it is left at its default "on" value.)
>>>
>> Where is this upstream commit again - it shows as unknown for the
>> kernel, xserver and libdrm ?
>
> Apologies, that commit hash belongs to a cherry-picked patch on one of
> my private branches; the original is
>
>   ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID
>
>
>> So my theory was a bit off - SetVersion is the one responsible to set
>> the "BusID", as retrieved by drmGetBusID, regardless if drmOpen or
>> open is used.
>>
>> Here's a bit of a brain dump from the other day:
>>
>>  - The commit mentioned
>
> (Here I assume you mean kernel commit
>
>   a325725633c2 drm: Lobotomize set_busid nonsense for !pci drivers
>
> again.)
>
Correct.

>>'affects' the drmSetBusid/DRM_IOCTL_SET_UNIQUE
>> userspace codepaths.
>>  - The latter itself is dri1/legacy (xserver hw/xfree86/dri/) which is
>> not functional for platform devices.
>> The latter of which seems to be the case for virt-gpu based on the
>> kernel module.
>>  - The modesetting driver should/cannot reach the above xserver codepath
>>
>> That said, it seems that (at least some) userspace expects a PCI
>> device despite the kernel module 'advertising' itself as platform one
>> :-\
>>
>> Going through the xserver layers is a bit inspiring I'm wondering if
>> we can not get a strace before/after the xserver commit
>> ca8d88e50310a0d440a127c22a0a383cc149f408 ? It will help us track
>> things a lot quicker/easier.
>
> This commit:
>
>   ca8d88e50310 xfree86: recognize primary BUS_PCI device in
>                xf86IsPrimaryPlatform()
>
> is not relevant when running Xorg on QEMU's "virtio-vga" device, it is
> only relevant when the "virtio-gpu-pci" device is used. Aarch64 guests
> can't have "virtio-vga", only "virgio-gpu-pci", so that commit was
> needed for Aarch64 guests primarily.
>
> In this case however, kernel commit a325725633c2 breaks "virtio-vga" in
> x86_64 guests, both with and without xserver commit ca8d88e50310.
>
> Anyway I'll look into capturing a strace and/or following Xorg with gdb.
>
If the overall case/issue is irrelevant of the xserver commit we can
ignore if for the time being.
In that case can you please get a xserver strace before/after the
offending kernel commit.

Thank you.
Emil
Laszlo Ersek Oct. 3, 2016, 1:03 p.m. UTC | #9
On 10/03/16 14:46, Emil Velikov wrote:
> On 3 October 2016 at 13:14, Laszlo Ersek <lersek@redhat.com> wrote:
>> On 10/03/16 13:34, Emil Velikov wrote:
>>> On 30 September 2016 at 18:26, Laszlo Ersek <lersek@redhat.com> wrote:
>>
>>>> This setup (modulo the kernel of course) was known to work, but now the
>>>> X server actually segfaults (apparently in the
>>>> xf86PlatformDeviceCheckBusID() function). Please find the logfile attached.
>>>>
>>>> (NB: this is unrelated to upstream commit de9ce6757c2e -- which the
>>>> pristine FC24 build lacks -- because I don't set AutoAddGPU to "off" --
>>>> it is left at its default "on" value.)
>>>>
>>> Where is this upstream commit again - it shows as unknown for the
>>> kernel, xserver and libdrm ?
>>
>> Apologies, that commit hash belongs to a cherry-picked patch on one of
>> my private branches; the original is
>>
>>   ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>
>>
>>> So my theory was a bit off - SetVersion is the one responsible to set
>>> the "BusID", as retrieved by drmGetBusID, regardless if drmOpen or
>>> open is used.
>>>
>>> Here's a bit of a brain dump from the other day:
>>>
>>>  - The commit mentioned
>>
>> (Here I assume you mean kernel commit
>>
>>   a325725633c2 drm: Lobotomize set_busid nonsense for !pci drivers
>>
>> again.)
>>
> Correct.
> 
>>> 'affects' the drmSetBusid/DRM_IOCTL_SET_UNIQUE
>>> userspace codepaths.
>>>  - The latter itself is dri1/legacy (xserver hw/xfree86/dri/) which is
>>> not functional for platform devices.
>>> The latter of which seems to be the case for virt-gpu based on the
>>> kernel module.
>>>  - The modesetting driver should/cannot reach the above xserver codepath
>>>
>>> That said, it seems that (at least some) userspace expects a PCI
>>> device despite the kernel module 'advertising' itself as platform one
>>> :-\
>>>
>>> Going through the xserver layers is a bit inspiring I'm wondering if
>>> we can not get a strace before/after the xserver commit
>>> ca8d88e50310a0d440a127c22a0a383cc149f408 ? It will help us track
>>> things a lot quicker/easier.
>>
>> This commit:
>>
>>   ca8d88e50310 xfree86: recognize primary BUS_PCI device in
>>                xf86IsPrimaryPlatform()
>>
>> is not relevant when running Xorg on QEMU's "virtio-vga" device, it is
>> only relevant when the "virtio-gpu-pci" device is used. Aarch64 guests
>> can't have "virtio-vga", only "virgio-gpu-pci", so that commit was
>> needed for Aarch64 guests primarily.
>>
>> In this case however, kernel commit a325725633c2 breaks "virtio-vga" in
>> x86_64 guests, both with and without xserver commit ca8d88e50310.
>>
>> Anyway I'll look into capturing a strace and/or following Xorg with gdb.
>>
> If the overall case/issue is irrelevant of the xserver commit we can
> ignore if for the time being.
> In that case can you please get a xserver strace before/after the
> offending kernel commit.

Please find them at <http://people.redhat.com/~lersek/set_busid/>.

Thanks!
Laszlo
Laszlo Ersek Oct. 3, 2016, 2:15 p.m. UTC | #10
On 10/03/16 13:34, Emil Velikov wrote:
> On 30 September 2016 at 18:26, Laszlo Ersek <lersek@redhat.com> wrote:
>> On 09/30/16 18:38, Hans de Goede wrote:
>>> Hi,
>>>
>>> On 30-09-16 17:33, Laszlo Ersek wrote:
>>>> On 09/30/16 16:59, Hans de Goede wrote:
>>>>> Hi,
>>>>>
>>>>> On 30-09-16 16:51, Laszlo Ersek wrote:
>>>>>> On 09/30/16 12:35, Hans de Goede wrote:
>>>>>>
>>>>>>> Attached are 2 patches against the xserver which should fix this,
>>>>>>> please give them a try.
>>>>>>
>>>>>> Sorry about the delay.
>>>>>>
>>>>>> The patches don't seem to fix the issue for me. Please see the Xorg log
>>>>>> attached.
>>>>>>
>>>>>> I tested the patches as follows. Given that my bisection had been done
>>>>>> in a Fedora 24 guest, using
>>>>>>
>>>>>>   xorg-x11-server-1.18.4-4.fc24
>>>>>>   http://koji.fedoraproject.org/koji/buildinfo?buildID=794494
>>>>>>
>>>>>> I now rebuilt the guest kernel exactly at the failing commit (a325725
>>>>>> "drm: Lobotomize set_busid nonsense for !pci drivers"), and first
>>>>>> reproduced the issue with the above X server.
>>>>>>
>>>>>> Then, I ported your patches to "xorg-server-1.18.4" (using the upstream
>>>>>> xserver tree), and rebuilt the Fedora package with the backport. For
>>>>>> the
>>>>>> backport, I had to cherry-pick the following two patches from master
>>>>>> first:
>>>>>>
>>>>>> 1 ca8d88e50310 xfree86: recognize primary BUS_PCI device in
>>>>>>                xf86IsPrimaryPlatform()
>>>>>> 2 ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>>>>
>>>>>> This way your patches applied cleanly. (Cherry pick #1 above is
>>>>>> actually
>>>>>> necessary for semantics, while cherry pick #2 is needed for a clean
>>>>>> context only, and has no impact for this test.)
>>>>>>
>>>>>> That is, in total, I added the following four patches to the Fedora 24
>>>>>> package:
>>>>>>
>>>>>> 1 xfree86: recognize primary BUS_PCI device in xf86IsPrimaryPlatform()
>>>>>> 2 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>>>> 3 xfree86: Make adding unclaimed devices as GPU devices a separate step
>>>>>> 4 xfree86: Try harder to find atleast 1 non GPU Screen
>>>>>>
>>>>>> You can find the scratch build that I used for testing here:
>>>>>>
>>>>>>   xorg-x11-server-1.18.4-4.hans_bz1366842_2.fc24
>>>>>>   http://koji.fedoraproject.org/koji/taskinfo?taskID=15875087
>>>>>>
>>>>>> Another reason I used F24's X server as basis, rather than upstream
>>>>>> HEAD, is that Fedora 24 is pretty young, and it's already on kernel
>>>>>> 4.7.4, and I believe it will soon move to kernel 4.8, without
>>>>>> (necessarily) rebasing its X server package to upstream. IOW the kernel
>>>>>> upgrade to 4.8 will break X in Fedora 24 too, and then I expect the
>>>>>> Fedora X maintainers would have to cherry pick those two patches as
>>>>>> dependencies just the same.
>>>>>>
>>>>>> To summarize, the patches don't seem to help. I shall nonetheless thank
>>>>>> you for spending your Friday on this!
>>>>>
>>>>> Hmm, do you have a xorg.conf file lying around somewhere, the message
>>>>> about the xserver not being able to find an entry for screen 0 does
>>>>> not make sense ...
>>>>
>>>> Good catch, I actually had two files under "/etc/X11/xorg.conf.d/":
>>>>
>>>> * "00-keyboard.conf", from package "systemd-229-13.fc24.x86_64", with
>>>> contents
>>>>
>>>> ------------
>>>> # Read and parsed by systemd-localed. It's probably wise not to edit
>>>> this file
>>>> # manually too freely.
>>>> Section "InputClass"
>>>>         Identifier "system-keyboard"
>>>>         MatchIsKeyboard "on"
>>>>         Option "XkbLayout" "us"
>>>> EndSection
>>>> ------------
>>>>
>>>> * "01-resolution.conf", which I had created, in order to set the
>>>> preferred display resolution:
>>>>
>>>> ------------
>>>> Section "Screen"
>>>>   Identifier "Default Screen"
>>>>   Device     "Default Device"
>>>>   Monitor    "Default Monitor"
>>>> EndSection
>>>>
>>>> Section "Device"
>>>>   Identifier "Default Device"
>>>>   Driver     "modesetting"
>>>> EndSection
>>>>
>>>> Section "Monitor"
>>>>   Identifier "Default Monitor"
>>>>   Option     "PreferredMode"   "640x480"
>>>> # Option     "PreferredMode"   "1440x900"
>>>> EndSection
>>>> ------------
>>>>
>>>> I removed these files now, and repeated the test. Again, the X server
>>>> wouldn't start, but I think the log file looks a bit different now.
>>>> Attached.
>>>
>>> Ah, ok so it seems that my initial analysis is wrong, the problem
>>> is not a re-occuring of the device getting identified as a GPU screen,
>>> libdrm sorta depends on bus-ids and the lack of one is causing the
>>> server to misbehave. I guess that even with a xorg.conf things
>>> will fail with the troublesome kernel version (might be worth
>>> trying).
>>>
>>> Emil's analysis seems to be spot on. This does not seem easily
>>> fixable in userspace / does seem like a real regression as it
>>> even breaks things when specifying the device through xorg.conf
>>> (I or so I believe) which is something which uses to work ...
>>
>> In order to check this hypothesis, I did the following:
>> - I downgraded my xorg-x11-server installation to the most recent
>> official F24 packages, that is, "1.18.4-4.fc24",
>> - I kept the kernel that I built exactly at the regressive commit
>> (a325725633c2)
>> - I modified "01-resolution.conf" (see it above in the context) like this:
>>
>> ----
>> Section "Device"
>>   Identifier "Default Device"
>>   Driver     "modesetting"
>>   BusID      "PCI:00:02:0" <------------ new option added
>> EndSection
>> ----
>>
>> where BusID matches the B/D/F of the virtio-vga device from "lspci".
>>
>> This setup (modulo the kernel of course) was known to work, but now the
>> X server actually segfaults (apparently in the
>> xf86PlatformDeviceCheckBusID() function). Please find the logfile attached.
>>
>> (NB: this is unrelated to upstream commit de9ce6757c2e -- which the
>> pristine FC24 build lacks -- because I don't set AutoAddGPU to "off" --
>> it is left at its default "on" value.)
>>
> Where is this upstream commit again - it shows as unknown for the
> kernel, xserver and libdrm ?
> 
> So my theory was a bit off - SetVersion is the one responsible to set
> the "BusID", as retrieved by drmGetBusID, regardless if drmOpen or
> open is used.
> 
> Here's a bit of a brain dump from the other day:
> 
>  - The commit mentioned 'affects' the drmSetBusid/DRM_IOCTL_SET_UNIQUE
> userspace codepaths.
>  - The latter itself is dri1/legacy (xserver hw/xfree86/dri/) which is
> not functional for platform devices.
> The latter of which seems to be the case for virt-gpu based on the
> kernel module.
>  - The modesetting driver should/cannot reach the above xserver codepath
> 
> That said, it seems that (at least some) userspace expects a PCI
> device despite the kernel module 'advertising' itself as platform one
> :-\

With kernel commit a325725633c2 applied, the drmGetBusid() call in
get_drm_info() [hw/xfree86/os-support/linux/lnx_platform.c] returns
"virtio0".

Without kernel commit a325725633c2 in place, the same function call
produces "pci:0000:00:02.0".

I think that the idea of kernel commit a325725633c2:

    We already have a fallback in place to fill out the unique from
    dev->unique, which is set to something reasonable in drm_dev_alloc.

is inappropriate for virtio devices.

While it is true that "virtioN" is unique across the guest system, those
identifiers are not stable.

# ls -1d /sys/devices/pci*/*/virtio*
/sys/devices/pci0000:00/0000:00:02.0/virtio0
/sys/devices/pci0000:00/0000:00:03.0/virtio1
/sys/devices/pci0000:00/0000:00:05.0/virtio2
/sys/devices/pci0000:00/0000:00:06.0/virtio3
/sys/devices/pci0000:00/0000:00:09.0/virtio4

If you tweak the PCI addresses of other virtio devices on the QEMU
command line, while keeping the PCI address of the virtio-vga device
intact, the "virtioN" identifier of the virtio-vga device may change in
an unspecified / unexpected manner.

From the above listing, it seems like higher PCI Device numbers happen
to end up with higher "virtioN" identifiers. While this is unspecified /
non-guaranteed behavior, in practice it means the following:

Moving the "virtio1" device from PCI address 00:03.0 to 00:01.0, while
preserving the PCI address of virtio-vga as 00:02.0, will bump
virtio-vga to "virtio1" from "virtio0" (and sink that other device from
"virtio1" to "virtio0").

I think that unstable identifiers don't qualify for BusID use,
regardless of the actual format of the IDs in question.

Searching the patch quickly, drm_virtio_set_busid() is the only
implementation of the "drm_driver.set_busid" callback that delegates the
task to drm_pci_set_busid(). All other implementations of this callback
were provided by drm_platform_set_busid().

drm_platform_set_busid() would ultimately format
"dev->platformdev->name" and "dev->platformdev->id" into the bus
identifier. Replacing that common logic with the drm_dev_alloc()
fallback that is already in place is probably justified: judging from
"virtio0", the fallback likely captures the exact same information, just
with a different format.

However, drm_virtio_set_busid() used to implement a different logic
(encoding different information). It intentionally used stable PCI
addresses rather than (platformdev->name, platformdev->id).

So, I think that removing drm_virtio_set_busid() was an actual
regression. I propose that the related hunks of a325725633c2 be
reverted. For virtio-vga and virtio-gpu-pci, "virtioN" is an unstable
and inappropraite BusID pattern.

Thanks
Laszlo
Laszlo Ersek Oct. 3, 2016, 2:27 p.m. UTC | #11
On 10/03/16 16:15, Laszlo Ersek wrote:
> On 10/03/16 13:34, Emil Velikov wrote:
>> On 30 September 2016 at 18:26, Laszlo Ersek <lersek@redhat.com> wrote:
>>> On 09/30/16 18:38, Hans de Goede wrote:
>>>> Hi,
>>>>
>>>> On 30-09-16 17:33, Laszlo Ersek wrote:
>>>>> On 09/30/16 16:59, Hans de Goede wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 30-09-16 16:51, Laszlo Ersek wrote:
>>>>>>> On 09/30/16 12:35, Hans de Goede wrote:
>>>>>>>
>>>>>>>> Attached are 2 patches against the xserver which should fix this,
>>>>>>>> please give them a try.
>>>>>>>
>>>>>>> Sorry about the delay.
>>>>>>>
>>>>>>> The patches don't seem to fix the issue for me. Please see the Xorg log
>>>>>>> attached.
>>>>>>>
>>>>>>> I tested the patches as follows. Given that my bisection had been done
>>>>>>> in a Fedora 24 guest, using
>>>>>>>
>>>>>>>   xorg-x11-server-1.18.4-4.fc24
>>>>>>>   http://koji.fedoraproject.org/koji/buildinfo?buildID=794494
>>>>>>>
>>>>>>> I now rebuilt the guest kernel exactly at the failing commit (a325725
>>>>>>> "drm: Lobotomize set_busid nonsense for !pci drivers"), and first
>>>>>>> reproduced the issue with the above X server.
>>>>>>>
>>>>>>> Then, I ported your patches to "xorg-server-1.18.4" (using the upstream
>>>>>>> xserver tree), and rebuilt the Fedora package with the backport. For
>>>>>>> the
>>>>>>> backport, I had to cherry-pick the following two patches from master
>>>>>>> first:
>>>>>>>
>>>>>>> 1 ca8d88e50310 xfree86: recognize primary BUS_PCI device in
>>>>>>>                xf86IsPrimaryPlatform()
>>>>>>> 2 ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>>>>>
>>>>>>> This way your patches applied cleanly. (Cherry pick #1 above is
>>>>>>> actually
>>>>>>> necessary for semantics, while cherry pick #2 is needed for a clean
>>>>>>> context only, and has no impact for this test.)
>>>>>>>
>>>>>>> That is, in total, I added the following four patches to the Fedora 24
>>>>>>> package:
>>>>>>>
>>>>>>> 1 xfree86: recognize primary BUS_PCI device in xf86IsPrimaryPlatform()
>>>>>>> 2 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>>>>> 3 xfree86: Make adding unclaimed devices as GPU devices a separate step
>>>>>>> 4 xfree86: Try harder to find atleast 1 non GPU Screen
>>>>>>>
>>>>>>> You can find the scratch build that I used for testing here:
>>>>>>>
>>>>>>>   xorg-x11-server-1.18.4-4.hans_bz1366842_2.fc24
>>>>>>>   http://koji.fedoraproject.org/koji/taskinfo?taskID=15875087
>>>>>>>
>>>>>>> Another reason I used F24's X server as basis, rather than upstream
>>>>>>> HEAD, is that Fedora 24 is pretty young, and it's already on kernel
>>>>>>> 4.7.4, and I believe it will soon move to kernel 4.8, without
>>>>>>> (necessarily) rebasing its X server package to upstream. IOW the kernel
>>>>>>> upgrade to 4.8 will break X in Fedora 24 too, and then I expect the
>>>>>>> Fedora X maintainers would have to cherry pick those two patches as
>>>>>>> dependencies just the same.
>>>>>>>
>>>>>>> To summarize, the patches don't seem to help. I shall nonetheless thank
>>>>>>> you for spending your Friday on this!
>>>>>>
>>>>>> Hmm, do you have a xorg.conf file lying around somewhere, the message
>>>>>> about the xserver not being able to find an entry for screen 0 does
>>>>>> not make sense ...
>>>>>
>>>>> Good catch, I actually had two files under "/etc/X11/xorg.conf.d/":
>>>>>
>>>>> * "00-keyboard.conf", from package "systemd-229-13.fc24.x86_64", with
>>>>> contents
>>>>>
>>>>> ------------
>>>>> # Read and parsed by systemd-localed. It's probably wise not to edit
>>>>> this file
>>>>> # manually too freely.
>>>>> Section "InputClass"
>>>>>         Identifier "system-keyboard"
>>>>>         MatchIsKeyboard "on"
>>>>>         Option "XkbLayout" "us"
>>>>> EndSection
>>>>> ------------
>>>>>
>>>>> * "01-resolution.conf", which I had created, in order to set the
>>>>> preferred display resolution:
>>>>>
>>>>> ------------
>>>>> Section "Screen"
>>>>>   Identifier "Default Screen"
>>>>>   Device     "Default Device"
>>>>>   Monitor    "Default Monitor"
>>>>> EndSection
>>>>>
>>>>> Section "Device"
>>>>>   Identifier "Default Device"
>>>>>   Driver     "modesetting"
>>>>> EndSection
>>>>>
>>>>> Section "Monitor"
>>>>>   Identifier "Default Monitor"
>>>>>   Option     "PreferredMode"   "640x480"
>>>>> # Option     "PreferredMode"   "1440x900"
>>>>> EndSection
>>>>> ------------
>>>>>
>>>>> I removed these files now, and repeated the test. Again, the X server
>>>>> wouldn't start, but I think the log file looks a bit different now.
>>>>> Attached.
>>>>
>>>> Ah, ok so it seems that my initial analysis is wrong, the problem
>>>> is not a re-occuring of the device getting identified as a GPU screen,
>>>> libdrm sorta depends on bus-ids and the lack of one is causing the
>>>> server to misbehave. I guess that even with a xorg.conf things
>>>> will fail with the troublesome kernel version (might be worth
>>>> trying).
>>>>
>>>> Emil's analysis seems to be spot on. This does not seem easily
>>>> fixable in userspace / does seem like a real regression as it
>>>> even breaks things when specifying the device through xorg.conf
>>>> (I or so I believe) which is something which uses to work ...
>>>
>>> In order to check this hypothesis, I did the following:
>>> - I downgraded my xorg-x11-server installation to the most recent
>>> official F24 packages, that is, "1.18.4-4.fc24",
>>> - I kept the kernel that I built exactly at the regressive commit
>>> (a325725633c2)
>>> - I modified "01-resolution.conf" (see it above in the context) like this:
>>>
>>> ----
>>> Section "Device"
>>>   Identifier "Default Device"
>>>   Driver     "modesetting"
>>>   BusID      "PCI:00:02:0" <------------ new option added
>>> EndSection
>>> ----
>>>
>>> where BusID matches the B/D/F of the virtio-vga device from "lspci".
>>>
>>> This setup (modulo the kernel of course) was known to work, but now the
>>> X server actually segfaults (apparently in the
>>> xf86PlatformDeviceCheckBusID() function). Please find the logfile attached.
>>>
>>> (NB: this is unrelated to upstream commit de9ce6757c2e -- which the
>>> pristine FC24 build lacks -- because I don't set AutoAddGPU to "off" --
>>> it is left at its default "on" value.)
>>>
>> Where is this upstream commit again - it shows as unknown for the
>> kernel, xserver and libdrm ?
>>
>> So my theory was a bit off - SetVersion is the one responsible to set
>> the "BusID", as retrieved by drmGetBusID, regardless if drmOpen or
>> open is used.
>>
>> Here's a bit of a brain dump from the other day:
>>
>>  - The commit mentioned 'affects' the drmSetBusid/DRM_IOCTL_SET_UNIQUE
>> userspace codepaths.
>>  - The latter itself is dri1/legacy (xserver hw/xfree86/dri/) which is
>> not functional for platform devices.
>> The latter of which seems to be the case for virt-gpu based on the
>> kernel module.
>>  - The modesetting driver should/cannot reach the above xserver codepath
>>
>> That said, it seems that (at least some) userspace expects a PCI
>> device despite the kernel module 'advertising' itself as platform one
>> :-\
> 
> With kernel commit a325725633c2 applied, the drmGetBusid() call in
> get_drm_info() [hw/xfree86/os-support/linux/lnx_platform.c] returns
> "virtio0".
> 
> Without kernel commit a325725633c2 in place, the same function call
> produces "pci:0000:00:02.0".
> 
> I think that the idea of kernel commit a325725633c2:
> 
>     We already have a fallback in place to fill out the unique from
>     dev->unique, which is set to something reasonable in drm_dev_alloc.
> 
> is inappropriate for virtio devices.
> 
> While it is true that "virtioN" is unique across the guest system, those
> identifiers are not stable.
> 
> # ls -1d /sys/devices/pci*/*/virtio*
> /sys/devices/pci0000:00/0000:00:02.0/virtio0
> /sys/devices/pci0000:00/0000:00:03.0/virtio1
> /sys/devices/pci0000:00/0000:00:05.0/virtio2
> /sys/devices/pci0000:00/0000:00:06.0/virtio3
> /sys/devices/pci0000:00/0000:00:09.0/virtio4
> 
> If you tweak the PCI addresses of other virtio devices on the QEMU
> command line, while keeping the PCI address of the virtio-vga device
> intact, the "virtioN" identifier of the virtio-vga device may change in
> an unspecified / unexpected manner.
> 
> From the above listing, it seems like higher PCI Device numbers happen
> to end up with higher "virtioN" identifiers. While this is unspecified /
> non-guaranteed behavior, in practice it means the following:
> 
> Moving the "virtio1" device from PCI address 00:03.0 to 00:01.0, while
> preserving the PCI address of virtio-vga as 00:02.0, will bump
> virtio-vga to "virtio1" from "virtio0" (and sink that other device from
> "virtio1" to "virtio0").
> 
> I think that unstable identifiers don't qualify for BusID use,
> regardless of the actual format of the IDs in question.
> 
> Searching the patch quickly, drm_virtio_set_busid() is the only
> implementation of the "drm_driver.set_busid" callback that delegates the
> task to drm_pci_set_busid(). All other implementations of this callback
> were provided by drm_platform_set_busid().
> 
> drm_platform_set_busid() would ultimately format
> "dev->platformdev->name" and "dev->platformdev->id" into the bus
> identifier. Replacing that common logic with the drm_dev_alloc()
> fallback that is already in place is probably justified: judging from
> "virtio0", the fallback likely captures the exact same information, just
> with a different format.
> 
> However, drm_virtio_set_busid() used to implement a different logic
> (encoding different information). It intentionally used stable PCI
> addresses rather than (platformdev->name, platformdev->id).
> 
> So, I think that removing drm_virtio_set_busid() was an actual
> regression. I propose that the related hunks of a325725633c2 be
> reverted. For virtio-vga and virtio-gpu-pci, "virtioN" is an unstable
> and inappropraite BusID pattern.

In support of my argument, please see the following remark in commit
message:

    v4: Drop accidental amdgpu hunk (Emil).

Now, if you look up the v3 posting for that amdgpu hunk:

  https://lists.freedesktop.org/archives/dri-devel/2016-June/111200.html
  https://lists.freedesktop.org/archives/dri-devel/2016-June/111486.html

the v3 patch accidentally removed the similarly customized set_busid
callback for amdgpu -- drm_pci_set_busid(). Emil caught that error in
review, hence the v4 patch wouldn't contain the same error.

My argument is exactly the same -- just as the amdgpu hunk shouldn't be
in the patch, the drm_virtio_set_busid() hunk shouldn't be there either.
Both amdgpu and virtio-gpu implement a set_busid callback -- by
delegating to drm_pci_set_busid() -- that *cannot* be replaced with the
fallback in drm_dev_alloc().

Please revert the drm_virtio_set_busid() part of kernel commit
a325725633c2; at this point I'm 95% sure it was an oversight that didn't
get caught in review.

The rest of kernel commit a325725633c2 looks sane to me; while it
changes the formatting of the busid for the affected platform devices,
relying on the fallback in those cases actually preserves the same
information content. (This is not true for amdgpu and virtio-gpu.)

Thanks
Laszlo
Emil Velikov Oct. 3, 2016, 3 p.m. UTC | #12
On 3 October 2016 at 15:27, Laszlo Ersek <lersek@redhat.com> wrote:
[snip]
> the v3 patch accidentally removed the similarly customized set_busid
> callback for amdgpu -- drm_pci_set_busid(). Emil caught that error in
> review, hence the v4 patch wouldn't contain the same error.
>
You're spot on - virtio-gpu doesn't use drm_*platform*_set_busid,
despite that I've misread it as such for over a dozen times. So yes,
that hunk should not have been removed.

Lacking a git checkout on this system, so if anyone beats me to it and
sends a patch that'll be great.

Emil
Laszlo Ersek Oct. 3, 2016, 3:19 p.m. UTC | #13
On 10/03/16 17:00, Emil Velikov wrote:
> On 3 October 2016 at 15:27, Laszlo Ersek <lersek@redhat.com> wrote:
> [snip]
>> the v3 patch accidentally removed the similarly customized set_busid
>> callback for amdgpu -- drm_pci_set_busid(). Emil caught that error in
>> review, hence the v4 patch wouldn't contain the same error.
>>
> You're spot on - virtio-gpu doesn't use drm_*platform*_set_busid,
> despite that I've misread it as such for over a dozen times. So yes,
> that hunk should not have been removed.
> 
> Lacking a git checkout on this system, so if anyone beats me to it and
> sends a patch that'll be great.

I'll send a patch ASAP.

Thank you
Laszlo
diff mbox

Patch

From 3241fcf1636e1b1cb8e0fdd698a2d2e00d1bbcc0 Mon Sep 17 00:00:00 2001
From: Hans de Goede <hdegoede@redhat.com>
Date: Fri, 30 Sep 2016 12:29:09 +0200
Subject: [PATCH 2/2] xfree86: Try harder to find atleast 1 non GPU Screen

If we did not find any non GPU Screens, try again ignoring the notion
of any video devices being the primary device. This fixes Xorg exiting
with a "no screens found" error when using virtio-vga in a
virtual-machine and when using a device driven by simpledrm.

This is a somewhat ugly solution, but it is the best I can come up with
without major surgery to the bus and probe code.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
---
 hw/xfree86/common/xf86.h            |  1 +
 hw/xfree86/common/xf86Bus.c         | 26 +++++++++++++++++++++++---
 hw/xfree86/common/xf86Globals.c     |  1 +
 hw/xfree86/common/xf86pciBus.c      |  4 ++++
 hw/xfree86/common/xf86platformBus.c |  4 ++++
 5 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/hw/xfree86/common/xf86.h b/hw/xfree86/common/xf86.h
index e54c811..f724688 100644
--- a/hw/xfree86/common/xf86.h
+++ b/hw/xfree86/common/xf86.h
@@ -55,6 +55,7 @@ 
 extern _X_EXPORT int xf86DoConfigure;
 extern _X_EXPORT int xf86DoShowOptions;
 extern _X_EXPORT Bool xf86DoConfigurePass1;
+extern _X_EXPORT Bool xf86ProbeIgnorePrimary;
 extern _X_EXPORT Bool xorgHWAccess;
 
 extern _X_EXPORT DevPrivateKeyRec xf86ScreenKeyRec;
diff --git a/hw/xfree86/common/xf86Bus.c b/hw/xfree86/common/xf86Bus.c
index a3a9898..9836803 100644
--- a/hw/xfree86/common/xf86Bus.c
+++ b/hw/xfree86/common/xf86Bus.c
@@ -117,14 +117,34 @@  xf86BusConfig(void)
     int i, j;
 
     /*
-     * Now call each of the Probe functions.  Each successful probe will
-     * result in an extra entry added to the xf86Screens[] list for each
-     * instance of the hardware found.
+     * 3 step probe to (hopefully) ensure that we always find at least 1
+     * (non GPU) screen:
+     *
+     * 1. Call each drivers probe function normally,
+     *    Each successful probe will result in an extra entry added to the
+     *    xf86Screens[] list for each instance of the hardware found.
      */
     for (i = 0; i < xf86NumDrivers; i++) {
         xf86CallDriverProbe(xf86DriverList[i], FALSE);
     }
 
+    /*
+     * 2. If no Screens were found, call each drivers probe function with
+     *    ignorePrimary = TRUE, to ensure that we do actually get a
+     *    Screen if there is atleast one supported video card.
+     */
+    if (xf86NumScreens == 0) {
+        xf86ProbeIgnorePrimary = TRUE;
+        for (i = 0; i < xf86NumDrivers && xf86NumScreens == 0; i++) {
+            xf86CallDriverProbe(xf86DriverList[i], FALSE);
+        }
+        xf86ProbeIgnorePrimary = FALSE;
+    }
+
+    /*
+     * 3. Call xf86platformAddGPUDevices() to add any additional video cards as
+     *    GPUScreens (GPUScreens are only supported by platformBus drivers).
+     */
     for (i = 0; i < xf86NumDrivers; i++) {
         xf86platformAddGPUDevices(xf86DriverList[i]);
     }
diff --git a/hw/xfree86/common/xf86Globals.c b/hw/xfree86/common/xf86Globals.c
index 072c3fc..0d1e31b 100644
--- a/hw/xfree86/common/xf86Globals.c
+++ b/hw/xfree86/common/xf86Globals.c
@@ -153,6 +153,7 @@  XF86ConfigPtr xf86configptr = NULL;
 Bool xf86Resetting = FALSE;
 Bool xf86Initialising = FALSE;
 Bool xf86DoConfigure = FALSE;
+Bool xf86ProbeIgnorePrimary = FALSE;
 Bool xf86DoShowOptions = FALSE;
 DriverPtr *xf86DriverList = NULL;
 int xf86NumDrivers = 0;
diff --git a/hw/xfree86/common/xf86pciBus.c b/hw/xfree86/common/xf86pciBus.c
index 8158c2b..9adfee5 100644
--- a/hw/xfree86/common/xf86pciBus.c
+++ b/hw/xfree86/common/xf86pciBus.c
@@ -352,6 +352,10 @@  xf86ComparePciBusString(const char *busID, int bus, int device, int func)
 Bool
 xf86IsPrimaryPci(struct pci_device *pPci)
 {
+    /* Add max. 1 screen for the IgnorePrimary fallback path */
+    if (xf86ProbeIgnorePrimary && xf86NumScreens == 0)
+        return TRUE;
+
     if (primaryBus.type == BUS_PCI)
         return pPci == primaryBus.id.pci;
 #ifdef XSERVER_PLATFORM_BUS
diff --git a/hw/xfree86/common/xf86platformBus.c b/hw/xfree86/common/xf86platformBus.c
index ec2a1dd..fde6796 100644
--- a/hw/xfree86/common/xf86platformBus.c
+++ b/hw/xfree86/common/xf86platformBus.c
@@ -115,6 +115,10 @@  xf86_find_platform_device_by_devnum(int major, int minor)
 static Bool
 xf86IsPrimaryPlatform(struct xf86_platform_device *plat)
 {
+    /* Add max. 1 screen for the IgnorePrimary fallback path */
+    if (xf86ProbeIgnorePrimary && xf86NumScreens == 0)
+        return TRUE;
+
     if (primaryBus.type == BUS_PLATFORM)
         return plat == primaryBus.id.plat;
 #ifdef XSERVER_LIBPCIACCESS
-- 
2.9.3