[RFC,v2,0/7] QOM: Singleton interface

Message ID	20241029211607.2114845-1-peterx@redhat.com (mailing list archive)
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Peter Xu <peterx@redhat.com> To: qemu-devel@nongnu.org Cc: Markus Armbruster <armbru@redhat.com>, Peter Maydell <peter.maydell@linaro.org>, Eduardo Habkost <eduardo@habkost.net>, Paolo Bonzini <pbonzini@redhat.com>, Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>, Igor Mammedov <imammedo@redhat.com>, "Michael S . Tsirkin" <mst@redhat.com>, Alex Williamson <alex.williamson@redhat.com>, "Dr . David Alan Gilbert" <dave@treblig.org>, peterx@redhat.com, =?utf-8?q?Daniel_P_=2E_Berrang=C3=A9?= <berrange@redhat.com>, =?utf-8?q?Phi?= =?utf-8?q?lippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>, =?utf-8?q?C=C3=A9?= =?utf-8?q?dric_Le_Goater?= <clg@redhat.com>, Fabiano Rosas <farosas@suse.de>, Juraj Marcin <jmarcin@redhat.com> Subject: [PATCH RFC v2 0/7] QOM: Singleton interface Date: Tue, 29 Oct 2024 17:16:00 -0400 Message-ID: <20241029211607.2114845-1-peterx@redhat.com> Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.133.124; envelope-from=peterx@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -23 X-Spam_score: -2.4 X-Spam_bar: -- X-Spam_report: (-2.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.302, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	QOM: Singleton interface \| expand [RFC,v2,0/7] QOM: Singleton interface [RFC,v2,1/7] qom: Track dynamic initiations of random object class [RFC,v2,2/7] qom: TYPE_SINGLETON interface [RFC,v2,3/7] qdev: Make device_set_realized() be fully prepared with !machine [RFC,v2,4/7] qdev: Make qdev_get_machine() safe before machine creates [RFC,v2,5/7] x86/iommu: Make x86-iommu a singleton object [RFC,v2,6/7] migration: Make migration object a singleton object [RFC,v2,7/7] migration: Reset current_migration properly

Peter Xu Oct. 29, 2024, 9:16 p.m. UTC

v1: https://lore.kernel.org/r/20241024165627.1372621-1-peterx@redhat.com

This patchset introduces the singleton interface for QOM.  I didn't add a
changelog because there're quite a few changes here and there, plus new
patches.  So it might just be easier to re-read, considering the patchset
isn't large.

I switched v2 into RFC, because we have reviewer concerns (Phil and Dan so
far) that it could be error prone to try to trap every attempts to create
an object.  My argument is, if we already have abstract class, meanwhile we
do not allow instantiation of abstract class, so the complexity is already
there.  I prepared patch 1 this time to collect and track all similar
random object creations; it might be helpful as a cleanup on its own to
deduplicate some similar error messages.  Said that, I'm still always open
to rejections to this proposal.

I hope v2 looks slightly cleaner by having not only object_new_allowed()
but also object_new_or_fetch().

Patch layout:

Patch 1-2:        The patches to introduce QOM singleton interface
Patch 3-5:        Add support for vIOMMU singleton, some qdev change needed
Patch 6-7:        Add support for migration singleton, fix dangle pointer

Below are copy-paste of the commit message of patch 2, that I could have
put into the cover letter too.

====8<====

The singleton interface declares an object class which can only create one
instance globally.

Backgrounds / Use Cases
=======================

There can be some existing classes that can start to benefit from it.  One
example is vIOMMU implementations.

QEMU emulated vIOMMUs are normally implemented on top of a generic device,
however it's special enough to normally only allow one instance of it for
the whole system, attached to the root complex.

These vIOMMU classes can be singletons in this case, so that QEMU can fail
or detect yet another attempt of creating such devices for more than once,
which can be fatal errors to a system.

We used to have some special code guarding it from happening.  In x86,
pc_machine_device_pre_plug_cb() has code to detect when vIOMMU is created
more than once, for instance.  With singleton class, logically we could
consider dropping the special code, but start to rely on QOM to make sure
there's only one vIOMMU for the whole system emulation.

There is a similar demand raising recently (even if the problem existed
over years) in migration.

Firstly, the migration object can currently be created randomly, even
though not wanted, e.g. during qom-list-properties QMP commands.  Ideally,
there can be cases where we want to have an object walking over the
properties, we could use the existing migration object instead of
dynamically create one.

Meanwhile, migration has a long standing issue on current_migration
pointer, where it can point to freed data after the migration object is
finalized.  It is debatable that the pointer can be cleared after the main
thread (1) join() the migration thread first, then (2) release the last
refcount for the migration object and clear the pointer.  However there's
still major challenges [1].  With singleton, we could have a slightly but
hopefully working workaround to clear the pointer during finalize().

Design
======

The idea behind is pretty simple: any object that can only be created once
can now declare the TYPE_SINGLETON interface. Then, QOM facilities will
make sure it won't be created more than once for the whole QEMU lifecycle.
Whenever possible (e.g., on object_new_allowed() checks), pretty error
message will be generated to report an error.  QOM also guards at the core
of object_new() so that any further violation of trying to create a
singleton more than once will crash QEMU as a programming error.

For example, qom-list-properties, device-list-properties, etc., will be
smart enough to not try to create temporary singleton objects if the class
is a singleton class and if there's existing instance created.  Such usages
should be rare, and this patch introduced object_new_or_fetch() just for
it, which either create a new temp object when available, or fetch the
instance if we found an existing singleton instance.  There're only two
such use cases.

Meanwhile, we also guard device-add or similar paths using the singleton
check in object_new_allowed(), so that it'll fail properly if a singleton
class instantiate more than one object.

Glib Singleton implementation
=============================

One note here to mention the Glib implementation of singletons [1].

QEMU chose not to follow Glib's implementation because Glib's version is
not thread safe on the constructor, so that two concurrent g_object_new()
on a single can race.  It's not ideal to QEMU, as QEMU has to not only
support the event-driven context which is normally lock-free, but also
the case where threads are heavily used.

It could be QEMU's desire to properly support multi-threading by default on
such new interface.  The "bad" side effect of that is, QEMU's object_new()
on singletons can assert failures if the singleton existed, but that's also
part of the design so as to forbid such from happening, taking which as a
programming error.  Meanwhile since we have pretty ways to fail elsewhere
on qdev creations, it should already guard us in a clean way, from anywhere
that the user could try to create the singleton more than once.

The current QEMU impl also guarantees object_new() always return a newly
allocated object as long as properly returned, rather than silently return
an existing object as what Glib's impl would do.  I see it a benefit, so as
to avoid unknown caller manipulate a global object, wrongly assuming it was
temporarily created.

[1] https://lore.kernel.org/qemu-devel/20190228122822.GD4970@work-vm/
[2] https://lore.kernel.org/r/ZxtqGQbd4Hq4APtm@redhat.com

Thanks,

Peter Xu (7):
  qom: Track dynamic initiations of random object class
  qom: TYPE_SINGLETON interface
  qdev: Make device_set_realized() be fully prepared with !machine
  qdev: Make qdev_get_machine() safe before machine creates
  x86/iommu: Make x86-iommu a singleton object
  migration: Make migration object a singleton object
  migration: Reset current_migration properly

 qapi/qdev.json                  |  2 +-
 qapi/qom.json                   |  2 +-
 include/qom/object.h            | 25 ++++++++++++++++
 include/qom/object_interfaces.h | 51 +++++++++++++++++++++++++++++++++
 chardev/char.c                  |  4 +--
 hw/core/cpu-common.c            | 13 ++++++---
 hw/core/qdev.c                  | 20 +++++++++++--
 hw/i386/x86-iommu.c             | 26 +++++++++++++++--
 migration/migration.c           | 36 +++++++++++++++++++++--
 qom/object.c                    | 50 ++++++++++++++++++++++++++++++--
 qom/object_interfaces.c         | 33 +++++++++++++++++++--
 qom/qom-qmp-cmds.c              |  4 +--
 system/qdev-monitor.c           |  4 +--
 13 files changed, 243 insertions(+), 27 deletions(-)

Daniel P. Berrangé Oct. 30, 2024, 9:48 a.m. UTC | #1

On Tue, Oct 29, 2024 at 05:16:00PM -0400, Peter Xu wrote:
> v1: https://lore.kernel.org/r/20241024165627.1372621-1-peterx@redhat.com

> Meanwhile, migration has a long standing issue on current_migration
> pointer, where it can point to freed data after the migration object is
> finalized.  It is debatable that the pointer can be cleared after the main
> thread (1) join() the migration thread first, then (2) release the last
> refcount for the migration object and clear the pointer.  However there's
> still major challenges [1].  With singleton, we could have a slightly but
> hopefully working workaround to clear the pointer during finalize().

I'm still not entirely convinced that this singleton proposal is
fixing the migration problem correctly.

Based on discussions in v1, IIUC, the situation is that we have
migration_shutdown() being called from qemu_cleanup(). The former
will call object_unref(current_migration), but there may still
be background migration threads running that access 'current_migration',
and thus a potential use-after-free.

Based on what the 7th patch here does, the key difference is that
the finalize() method for MigrationState will set 'current_migration'
to NULL after free'ing it.

I don't believe that is safe.

Back to the current code, if there is a use-after-free today, that
implies that the background threads are *not* holding their own
reference on 'current_migration', allowing the object to be free'd
while they're still using it. If they held their own reference then
the object_unref in migration_shutdown would not have any use after
free risk.

The new code is not changing the ref counting done by any threads.
Therefore if there's a use-after-free in existing code, AFAICT, the
same use-after-free *must* still exist in the current code.

The 7th patch only fixes the use-after-free, *if and only if* the
background thread tries to access 'current_migration', /after/
finalize as completed. The use-after-free in this case, has been
turned into a NULL pointer reference.

A background thread could be accessing the 'current_migration' pointer
*concurrently* with the finalize method executing though. In this case
we still have a use after free problem, only the time window in which
it exists has been narrowed a little.

Shouldn't the problem with migration be solved by every migration thread
holding a reference on current_migration, that the thread releases when
it exits, such that MigrationState is only finalized once every thread
has exited ? That would not require any join() synchronization point.

With regards,
Daniel

Peter Xu Oct. 30, 2024, 1:13 p.m. UTC | #2

On Wed, Oct 30, 2024 at 09:48:07AM +0000, Daniel P. Berrangé wrote:
> On Tue, Oct 29, 2024 at 05:16:00PM -0400, Peter Xu wrote:
> > v1: https://lore.kernel.org/r/20241024165627.1372621-1-peterx@redhat.com
> 
> > Meanwhile, migration has a long standing issue on current_migration
> > pointer, where it can point to freed data after the migration object is
> > finalized.  It is debatable that the pointer can be cleared after the main
> > thread (1) join() the migration thread first, then (2) release the last
> > refcount for the migration object and clear the pointer.  However there's
> > still major challenges [1].  With singleton, we could have a slightly but
> > hopefully working workaround to clear the pointer during finalize().
> 
> I'm still not entirely convinced that this singleton proposal is
> fixing the migration problem correctly.
> 
> Based on discussions in v1, IIUC, the situation is that we have
> migration_shutdown() being called from qemu_cleanup(). The former
> will call object_unref(current_migration), but there may still
> be background migration threads running that access 'current_migration',
> and thus a potential use-after-free.

migration thread is fine, it takes a refcount at the entry.

And btw, taking it at the entry is racy, we've just fixed it, see (in my
next migration pull):

https://lore.kernel.org/qemu-devel/20241024213056.1395400-2-peterx@redhat.com/

The access reported was, IIUC, outside migration code, but after both
main/migration threads released the refcount, hence after finalize().  It
could be a random migration_is_running() call very late in device code, for
example.

> 
> Based on what the 7th patch here does, the key difference is that
> the finalize() method for MigrationState will set 'current_migration'
> to NULL after free'ing it.

Yes.  But this show case series isn't complete.  We need a migration-side
lock finally to make it safe to access.  For that, see:

https://lore.kernel.org/qemu-devel/20241024213056.1395400-9-peterx@redhat.com/

> 
> I don't believe that is safe.

I hope after the other series applied it will be 100% safe, even though I
agree it's tricky.  But hopefully QOM is very clean, the trickly part is
still within migration, and it should be less tricky than migration
implement a refcount on top of Object..

> 
> Back to the current code, if there is a use-after-free today, that
> implies that the background threads are *not* holding their own
> reference on 'current_migration', allowing the object to be free'd
> while they're still using it. If they held their own reference then
> the object_unref in migration_shutdown would not have any use after
> free risk.
> 
> The new code is not changing the ref counting done by any threads.
> Therefore if there's a use-after-free in existing code, AFAICT, the
> same use-after-free *must* still exist in the current code.
> 
> The 7th patch only fixes the use-after-free, *if and only if* the
> background thread tries to access 'current_migration', /after/
> finalize as completed. The use-after-free in this case, has been
> turned into a NULL pointer reference.
> 
> A background thread could be accessing the 'current_migration' pointer
> *concurrently* with the finalize method executing though. In this case
> we still have a use after free problem, only the time window in which
> it exists has been narrowed a little.
> 
> Shouldn't the problem with migration be solved by every migration thread
> holding a reference on current_migration, that the thread releases when
> it exits, such that MigrationState is only finalized once every thread
> has exited ? That would not require any join() synchronization point.

I think the question is whether things like migration_is_running() is
allowed to be used anywhere, even after migration_shutdown().  My answer
is, it should be ok to be used anywhere, and we don't necessarilly need to
limit that.  In that case the caller doesn't need to take a refcount
because it's an immediate query.  It can simply check its existance with
the lock (after my patch 8 of the other series applied, which depends on
this qom series).

Thanks,

Daniel P. Berrangé Oct. 30, 2024, 4:13 p.m. UTC | #3

On Wed, Oct 30, 2024 at 09:13:13AM -0400, Peter Xu wrote:
> On Wed, Oct 30, 2024 at 09:48:07AM +0000, Daniel P. Berrangé wrote:
> > On Tue, Oct 29, 2024 at 05:16:00PM -0400, Peter Xu wrote:
> > > v1: https://lore.kernel.org/r/20241024165627.1372621-1-peterx@redhat.com
> > 
> > > Meanwhile, migration has a long standing issue on current_migration
> > > pointer, where it can point to freed data after the migration object is
> > > finalized.  It is debatable that the pointer can be cleared after the main
> > > thread (1) join() the migration thread first, then (2) release the last
> > > refcount for the migration object and clear the pointer.  However there's
> > > still major challenges [1].  With singleton, we could have a slightly but
> > > hopefully working workaround to clear the pointer during finalize().
> > 
> > I'm still not entirely convinced that this singleton proposal is
> > fixing the migration problem correctly.
> > 
> > Based on discussions in v1, IIUC, the situation is that we have
> > migration_shutdown() being called from qemu_cleanup(). The former
> > will call object_unref(current_migration), but there may still
> > be background migration threads running that access 'current_migration',
> > and thus a potential use-after-free.
> 
> migration thread is fine, it takes a refcount at the entry.
> 
> And btw, taking it at the entry is racy, we've just fixed it, see (in my
> next migration pull):
> 
> https://lore.kernel.org/qemu-devel/20241024213056.1395400-2-peterx@redhat.com/

Yep, acquiring the refcount immediately before thread-create
is what I meant.

> The access reported was, IIUC, outside migration code, but after both
> main/migration threads released the refcount, hence after finalize().  It
> could be a random migration_is_running() call very late in device code, for
> example.



> 
> > 
> > Based on what the 7th patch here does, the key difference is that
> > the finalize() method for MigrationState will set 'current_migration'
> > to NULL after free'ing it.
> 
> Yes.  But this show case series isn't complete.  We need a migration-side
> lock finally to make it safe to access.  For that, see:
> 
> https://lore.kernel.org/qemu-devel/20241024213056.1395400-9-peterx@redhat.com/
> 
> > 
> > I don't believe that is safe.
> 
> I hope after the other series applied it will be 100% safe, even though I
> agree it's tricky.  But hopefully QOM is very clean, the trickly part is
> still within migration, and it should be less tricky than migration
> implement a refcount on top of Object..

Ok, so with the other series applied, this does look safe, but
it also doesn't seem to really have any dependancy on the
single interface code.  Patch 7 here looks sufficient, in combo
with the other 2 series to avoid the use-after-free flaws.

> I think the question is whether things like migration_is_running() is
> allowed to be used anywhere, even after migration_shutdown().  My answer
> is, it should be ok to be used anywhere, and we don't necessarilly need to
> limit that.  In that case the caller doesn't need to take a refcount
> because it's an immediate query.  It can simply check its existance with
> the lock (after my patch 8 of the other series applied, which depends on
> this qom series).

Agree, and from a practical POV, I think it would be impossible to
require a ref count be held from other non-migration threads, so the
locking around 'current_migration' looks like the only practical
option.

With regards,
Daniel

Peter Xu Oct. 30, 2024, 5:51 p.m. UTC | #4

On Wed, Oct 30, 2024 at 04:13:57PM +0000, Daniel P. Berrangé wrote:
> On Wed, Oct 30, 2024 at 09:13:13AM -0400, Peter Xu wrote:
> > On Wed, Oct 30, 2024 at 09:48:07AM +0000, Daniel P. Berrangé wrote:
> > > On Tue, Oct 29, 2024 at 05:16:00PM -0400, Peter Xu wrote:
> > > > v1: https://lore.kernel.org/r/20241024165627.1372621-1-peterx@redhat.com
> > > 
> > > > Meanwhile, migration has a long standing issue on current_migration
> > > > pointer, where it can point to freed data after the migration object is
> > > > finalized.  It is debatable that the pointer can be cleared after the main
> > > > thread (1) join() the migration thread first, then (2) release the last
> > > > refcount for the migration object and clear the pointer.  However there's
> > > > still major challenges [1].  With singleton, we could have a slightly but
> > > > hopefully working workaround to clear the pointer during finalize().
> > > 
> > > I'm still not entirely convinced that this singleton proposal is
> > > fixing the migration problem correctly.
> > > 
> > > Based on discussions in v1, IIUC, the situation is that we have
> > > migration_shutdown() being called from qemu_cleanup(). The former
> > > will call object_unref(current_migration), but there may still
> > > be background migration threads running that access 'current_migration',
> > > and thus a potential use-after-free.
> > 
> > migration thread is fine, it takes a refcount at the entry.
> > 
> > And btw, taking it at the entry is racy, we've just fixed it, see (in my
> > next migration pull):
> > 
> > https://lore.kernel.org/qemu-devel/20241024213056.1395400-2-peterx@redhat.com/
> 
> Yep, acquiring the refcount immediately before thread-create
> is what I meant.
> 
> > The access reported was, IIUC, outside migration code, but after both
> > main/migration threads released the refcount, hence after finalize().  It
> > could be a random migration_is_running() call very late in device code, for
> > example.
> 
> 
> 
> > 
> > > 
> > > Based on what the 7th patch here does, the key difference is that
> > > the finalize() method for MigrationState will set 'current_migration'
> > > to NULL after free'ing it.
> > 
> > Yes.  But this show case series isn't complete.  We need a migration-side
> > lock finally to make it safe to access.  For that, see:
> > 
> > https://lore.kernel.org/qemu-devel/20241024213056.1395400-9-peterx@redhat.com/
> > 
> > > 
> > > I don't believe that is safe.
> > 
> > I hope after the other series applied it will be 100% safe, even though I
> > agree it's tricky.  But hopefully QOM is very clean, the trickly part is
> > still within migration, and it should be less tricky than migration
> > implement a refcount on top of Object..
> 
> Ok, so with the other series applied, this does look safe, but
> it also doesn't seem to really have any dependancy on the
> single interface code.  Patch 7 here looks sufficient, in combo
> with the other 2 series to avoid the use-after-free flaws.

Patch 7, when applied without patch 6 and prior, will crash in
device-introspect-test, trying to create yet another migration object when
processing the "device-list-properties" QMP command.  And it turns out
that's also not the only way QEMU can crash by that.

Fundamentally it's because patch 7 has global operations within
init()/finalize() to fix the migration dangling pointer, hence it must not
be instanciated more than once.

It's also probably because I always think singleton can be useful in
general to QEMU's device model where can be special devices all over the
places that I'm not aware of.  I didn't work on a lot of QEMU devices, but
with that limited experience I still stumbled upon two devices (if taking
migration object as one..) that might benefit from it.

That leads to this whole series, which is also the cleanest so far I can
think of to solve the immediate migration UAF.

Thanks,

> 
> > I think the question is whether things like migration_is_running() is
> > allowed to be used anywhere, even after migration_shutdown().  My answer
> > is, it should be ok to be used anywhere, and we don't necessarilly need to
> > limit that.  In that case the caller doesn't need to take a refcount
> > because it's an immediate query.  It can simply check its existance with
> > the lock (after my patch 8 of the other series applied, which depends on
> > this qom series).
> 
> Agree, and from a practical POV, I think it would be impossible to
> require a ref count be held from other non-migration threads, so the
> locking around 'current_migration' looks like the only practical
> option.
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Daniel P. Berrangé Oct. 30, 2024, 5:58 p.m. UTC | #5

On Wed, Oct 30, 2024 at 01:51:54PM -0400, Peter Xu wrote:
> On Wed, Oct 30, 2024 at 04:13:57PM +0000, Daniel P. Berrangé wrote:
> > On Wed, Oct 30, 2024 at 09:13:13AM -0400, Peter Xu wrote:
> > > On Wed, Oct 30, 2024 at 09:48:07AM +0000, Daniel P. Berrangé wrote:
> > > > On Tue, Oct 29, 2024 at 05:16:00PM -0400, Peter Xu wrote:
> > > > > v1: https://lore.kernel.org/r/20241024165627.1372621-1-peterx@redhat.com
> > > > 
> > > > > Meanwhile, migration has a long standing issue on current_migration
> > > > > pointer, where it can point to freed data after the migration object is
> > > > > finalized.  It is debatable that the pointer can be cleared after the main
> > > > > thread (1) join() the migration thread first, then (2) release the last
> > > > > refcount for the migration object and clear the pointer.  However there's
> > > > > still major challenges [1].  With singleton, we could have a slightly but
> > > > > hopefully working workaround to clear the pointer during finalize().
> > > > 
> > > > I'm still not entirely convinced that this singleton proposal is
> > > > fixing the migration problem correctly.
> > > > 
> > > > Based on discussions in v1, IIUC, the situation is that we have
> > > > migration_shutdown() being called from qemu_cleanup(). The former
> > > > will call object_unref(current_migration), but there may still
> > > > be background migration threads running that access 'current_migration',
> > > > and thus a potential use-after-free.
> > > 
> > > migration thread is fine, it takes a refcount at the entry.
> > > 
> > > And btw, taking it at the entry is racy, we've just fixed it, see (in my
> > > next migration pull):
> > > 
> > > https://lore.kernel.org/qemu-devel/20241024213056.1395400-2-peterx@redhat.com/
> > 
> > Yep, acquiring the refcount immediately before thread-create
> > is what I meant.
> > 
> > > The access reported was, IIUC, outside migration code, but after both
> > > main/migration threads released the refcount, hence after finalize().  It
> > > could be a random migration_is_running() call very late in device code, for
> > > example.
> > 
> > 
> > 
> > > 
> > > > 
> > > > Based on what the 7th patch here does, the key difference is that
> > > > the finalize() method for MigrationState will set 'current_migration'
> > > > to NULL after free'ing it.
> > > 
> > > Yes.  But this show case series isn't complete.  We need a migration-side
> > > lock finally to make it safe to access.  For that, see:
> > > 
> > > https://lore.kernel.org/qemu-devel/20241024213056.1395400-9-peterx@redhat.com/
> > > 
> > > > 
> > > > I don't believe that is safe.
> > > 
> > > I hope after the other series applied it will be 100% safe, even though I
> > > agree it's tricky.  But hopefully QOM is very clean, the trickly part is
> > > still within migration, and it should be less tricky than migration
> > > implement a refcount on top of Object..
> > 
> > Ok, so with the other series applied, this does look safe, but
> > it also doesn't seem to really have any dependancy on the
> > single interface code.  Patch 7 here looks sufficient, in combo
> > with the other 2 series to avoid the use-after-free flaws.
> 
> Patch 7, when applied without patch 6 and prior, will crash in
> device-introspect-test, trying to create yet another migration object when
> processing the "device-list-properties" QMP command.  And it turns out
> that's also not the only way QEMU can crash by that.
> 
> Fundamentally it's because patch 7 has global operations within
> init()/finalize() to fix the migration dangling pointer, hence it must not
> be instanciated more than once.

That's a result from moving the "assert()" into the constructor.
The assert(!current_migration) can be kept in migration_object_init,
the constructor could conditionally set current_migration only if it
is NULL, and the finalizer could conditionally clear current_migration
only if it matches the current object. There's no conceptual dependancy
on having a singleton interface in the patch.

With regards,
Daniel

Daniel P. Berrangé Oct. 30, 2024, 6:07 p.m. UTC | #6

On Tue, Oct 29, 2024 at 05:16:00PM -0400, Peter Xu wrote:
> v1: https://lore.kernel.org/r/20241024165627.1372621-1-peterx@redhat.com
> 
> This patchset introduces the singleton interface for QOM.  I didn't add a
> changelog because there're quite a few changes here and there, plus new
> patches.  So it might just be easier to re-read, considering the patchset
> isn't large.
> 
> I switched v2 into RFC, because we have reviewer concerns (Phil and Dan so
> far) that it could be error prone to try to trap every attempts to create
> an object.  My argument is, if we already have abstract class, meanwhile we
> do not allow instantiation of abstract class, so the complexity is already
> there.  I prepared patch 1 this time to collect and track all similar
> random object creations; it might be helpful as a cleanup on its own to
> deduplicate some similar error messages.  Said that, I'm still always open
> to rejections to this proposal.
> 
> I hope v2 looks slightly cleaner by having not only object_new_allowed()
> but also object_new_or_fetch().

For me, that doesn't really make it much more appealing. Yes, we already have
an abstract class, but that has narrower impact, as there are fewer places
in which which we can trigger instantiation of an abstract class, than
where we can trigger instantiation of arbitrary objects and devices.

The conversion of the iommu code results in worse error reporting, and
doesn't handle the virtio-iommu case, and the migration problems appear
solvable without inventing a singleton interface. So this doesn't feel
like it is worth the the trouble.

NB, my view point would have been different if  'object_new' had an
"Error *errp" parameter. That would have made handling failure a
standard part of the design pattern for object construction, thus
avoiding adding asserts in the 'object_new' codepath which could be
triggered by unexpected/badly validated user input.

With regards,
Daniel

Peter Xu Oct. 30, 2024, 6:55 p.m. UTC | #7

On Wed, Oct 30, 2024 at 05:58:15PM +0000, Daniel P. Berrangé wrote:
> That's a result from moving the "assert()" into the constructor.
> The assert(!current_migration) can be kept in migration_object_init,
> the constructor could conditionally set current_migration only if it
> is NULL, and the finalizer could conditionally clear current_migration
> only if it matches the current object. There's no conceptual dependancy
> on having a singleton interface in the patch.

Yes that could work, but that sounds more hackish from my POV, trying to
detect "which is the real migration object".  We need to assume (1) the
1st migration object being created must be the global migration object,
then (2) we still allow concurrent creations of such object, which I
personally don't really like, even with a hack already.

If this series cannot get accepted, I can try go with that, or I'll
implement the refcount in migration.c, whichever I found better at last.

To me, both are less clean comparing to singleton.

Peter Xu Oct. 30, 2024, 7:08 p.m. UTC | #8

On Wed, Oct 30, 2024 at 06:07:08PM +0000, Daniel P. Berrangé wrote:
> On Tue, Oct 29, 2024 at 05:16:00PM -0400, Peter Xu wrote:
> > v1: https://lore.kernel.org/r/20241024165627.1372621-1-peterx@redhat.com
> > 
> > This patchset introduces the singleton interface for QOM.  I didn't add a
> > changelog because there're quite a few changes here and there, plus new
> > patches.  So it might just be easier to re-read, considering the patchset
> > isn't large.
> > 
> > I switched v2 into RFC, because we have reviewer concerns (Phil and Dan so
> > far) that it could be error prone to try to trap every attempts to create
> > an object.  My argument is, if we already have abstract class, meanwhile we
> > do not allow instantiation of abstract class, so the complexity is already
> > there.  I prepared patch 1 this time to collect and track all similar
> > random object creations; it might be helpful as a cleanup on its own to
> > deduplicate some similar error messages.  Said that, I'm still always open
> > to rejections to this proposal.
> > 
> > I hope v2 looks slightly cleaner by having not only object_new_allowed()
> > but also object_new_or_fetch().
> 
> For me, that doesn't really make it much more appealing. Yes, we already have
> an abstract class, but that has narrower impact, as there are fewer places
> in which which we can trigger instantiation of an abstract class, than
> where we can trigger instantiation of arbitrary objects and devices.

There should be exactly the same number of places that will need care for
either abstract or singleton.  I tried to justify this with patch 1.

I still think patch 1 can be seen as a cleanup too on its own (dedups the
same "The class is abstract" error message), tracking random object
creations so logically we could have the idea on whether a class can be
instantiated at all, starting with abstract class.

The real extra "complexity" is object_new_or_fetch(), but I hope it's not a
major concern either.  We only have two such use (aka, "please give me an
object of class XXX"), which is qom/device-list-properties.  I don't expect
it to be common, I hope it's easy to maintain.

> 
> The conversion of the iommu code results in worse error reporting, and
> doesn't handle the virtio-iommu case, and the migration problems appear
> solvable without inventing a singleton interface. So this doesn't feel
> like it is worth the the trouble.

IMHO that's not a major issue, I can drop patch 3-5 just to make it simple
as of now.  Btw, I have a TODO in patch 2 where I mentioned we can provide
better error report if we want, so we can easily have exactly the same
error as before with maybe a few or 10+ LOCs on top.  It's trivial.

object_new_allowed():

+    if (object_class_is_singleton(klass)) {
+        Object *obj = singleton_get_instance(klass);
+
+        if (obj) {
+            object_unref(obj);
+            /*
+             * TODO: Enhance the error message.  E.g., the singleton class
+             * can provide a per-class error message in SingletonClass.
+             */
+            error_setg(errp, "Object type '%s' conflicts with "
+                       "an existing singleton instance",
+                       klass->type->name);
+            return false;
+        }
+    }

> 
> NB, my view point would have been different if  'object_new' had an
> "Error *errp" parameter. That would have made handling failure a
> standard part of the design pattern for object construction, thus
> avoiding adding asserts in the 'object_new' codepath which could be
> triggered by unexpected/badly validated user input.

Yes I also wished object_new() can take an Error** when I started working
on it.  It would make this much easier, indeed.  I suppose we don't need
that by not allowing instance_init() to fail at all, postponing things to
realize().  I suppose that's a "tactic" QEMU chose explicitly to make it
easy that object_new() callers keep like before with zero error handling
needed.  At least for TYPE_DEVICE it looks all fine if all such operations
can be offloaded into realize().  I'm not sure user creatable has those
steps also because of this limitation.

I was trying to do that with object_new_allowed() here instead, whenever it
could be triggered by an user input.  We could have an extra layer before
reaching object_new() to guard any user input, and I think
object_new_allowed() could play that role.  When / If we want to introduce
Error** to object_new() some day (or a variance of it), we could simply
move object_new_allowed() into it.

Thanks,

Daniel P. Berrangé Oct. 31, 2024, 3:57 p.m. UTC | #9

On Wed, Oct 30, 2024 at 03:08:08PM -0400, Peter Xu wrote:
> On Wed, Oct 30, 2024 at 06:07:08PM +0000, Daniel P. Berrangé wrote:
> > On Tue, Oct 29, 2024 at 05:16:00PM -0400, Peter Xu wrote:
> > > v1: https://lore.kernel.org/r/20241024165627.1372621-1-peterx@redhat.com
> > > 
> > > This patchset introduces the singleton interface for QOM.  I didn't add a
> > > changelog because there're quite a few changes here and there, plus new
> > > patches.  So it might just be easier to re-read, considering the patchset
> > > isn't large.
> > > 
> > > I switched v2 into RFC, because we have reviewer concerns (Phil and Dan so
> > > far) that it could be error prone to try to trap every attempts to create
> > > an object.  My argument is, if we already have abstract class, meanwhile we
> > > do not allow instantiation of abstract class, so the complexity is already
> > > there.  I prepared patch 1 this time to collect and track all similar
> > > random object creations; it might be helpful as a cleanup on its own to
> > > deduplicate some similar error messages.  Said that, I'm still always open
> > > to rejections to this proposal.
> > > 
> > > I hope v2 looks slightly cleaner by having not only object_new_allowed()
> > > but also object_new_or_fetch().
> > 
> > For me, that doesn't really make it much more appealing. Yes, we already have
> > an abstract class, but that has narrower impact, as there are fewer places
> > in which which we can trigger instantiation of an abstract class, than
> > where we can trigger instantiation of arbitrary objects and devices.
> 
> There should be exactly the same number of places that will need care for
> either abstract or singleton.  I tried to justify this with patch 1.
> 
> I still think patch 1 can be seen as a cleanup too on its own (dedups the
> same "The class is abstract" error message), tracking random object
> creations so logically we could have the idea on whether a class can be
> instantiated at all, starting with abstract class.

I think patch 1 might be incomplete, as I'm not seeing what checks
for abstract or singleton classes in the 'qdev_new' code paths, used
by -device / device_add QMP. This is an example of the risks of adding
more failure scenarios to object_add.

> > NB, my view point would have been different if  'object_new' had an
> > "Error *errp" parameter. That would have made handling failure a
> > standard part of the design pattern for object construction, thus
> > avoiding adding asserts in the 'object_new' codepath which could be
> > triggered by unexpected/badly validated user input.
> 
> Yes I also wished object_new() can take an Error** when I started working
> on it.  It would make this much easier, indeed.  I suppose we don't need
> that by not allowing instance_init() to fail at all, postponing things to
> realize().  I suppose that's a "tactic" QEMU chose explicitly to make it
> easy that object_new() callers keep like before with zero error handling
> needed.  At least for TYPE_DEVICE it looks all fine if all such operations
> can be offloaded into realize().  I'm not sure user creatable has those
> steps also because of this limitation.
> 
> I was trying to do that with object_new_allowed() here instead, whenever it
> could be triggered by an user input.  We could have an extra layer before
> reaching object_new() to guard any user input, and I think
> object_new_allowed() could play that role.  When / If we want to introduce
> Error** to object_new() some day (or a variance of it), we could simply
> move object_new_allowed() into it.

Yes, having thought about this today, I came up with a way that we could
introduce a object_new_dynamic() variant with "Error *errp" instead of
asserts, and *crucially* force its use in the unsafe scenarios. ie any
place that is not passing a const,static string.  I've CC'd you on an
RFC series that mocks up this idea. That would be sufficient to remove
my objections wrt the singleton concept.


With regards,
Daniel

[RFC,v2,0/7] QOM: Singleton interface

Message

Comments