Message ID | X+DbupqYE3rrFaIM@mattapan.m5p.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RESEND,RFC] xen/arm: domain_build: Ignore empty memory bank | expand |
Hi Elliott, I was planning to review the first version today, but as you sent a new version I will answer on this one directly. On 21/12/2020 17:30, Elliott Mitchell wrote: > Previously Xen had stopped processing Device Trees if an empty > (size == 0) memory bank was found. > > Commit 5a37207df52066efefe419c677b089a654d37afc changed this behavior to > ignore such banks. Unfortunately this means these empty nodes are > visible to code which accesses the device trees. Have domain_build also > ignore these entries. I am probably missing something here. The commit you pointed out will only take care of memory nodes (including reserved-memory). It should not be possible to reach handle_device() with actual RAM. Although, it would with the reserved memory node. Could you provide a bit more details on the issue? In particular, I am interested to see the offending node and its content. > --- > This is tagged "RFC" due to issues. > > Authorship of this is unclean. In the first version (checked in, but > never sent to the list and never compiled) a different condition was used > and the comment was absent. When examing the code it became clear a > condition identical to > 5a37207df52066efefe419c677b089a654d37afc was appropriate and so I changed > to !size. Since what the code is doing was sufficiently similar, the > comment was grabbed. > How far does this dilute authorship? I diagnosed the bug and figured out > where to add the lines, but the amount inspired by Julien Grall gives > Julien Grall some level of claim of authorship. Advice is needed. You did all the investigation of the bug and the code is small enough. So I think it is fine for you to claim the authorship. > > Commit 7d2b21fd36c2a47799eed71c67bae7faa1ec4272 is an outright bug for > me. I don't know what percentage of users will experience this bug, but > being observed this quickly suggests this is major enough to be urgent > for the stable-4.14 branch. We usually work on fix for upstream first and then backport it. > I doubt this is the only bug exposed by > 5a37207df52066efefe419c677b089a654d37afc. Are you saying that with my patch dropped, Xen will boot but with it will not? > This might actually effect > most uses of the device-tree code. I think either the core needs to be > fixed to hide zero-sized entries from anything outside of > xen/common/device_tree.c, otherwise all uses of the device-tree core need > to be audited to ensure they ignore zero-sized entries. The meaning of zero-sized is not the same everywhere. In the case of memory banks, it can be safely ignored. For other devices (e.g. GIC), hiding it may make things worse because a size 0 means the node is bogus. > Notably this is > the second location where zero-size device-tree entries need to be > ignored, preemptive action should be taken before a third is found by > bugreport. > > Perhaps this fix is appropriate for the stable-4.14 branch and a proper > solution should be implemented for the main branch? > > The error message which first showed was > "Unable to retrieve address %u for %s\n". Where the number in %u was > 0, this seems a poor error message. Version 0.1 (which never got > compiled) had been: if(!addr) continue; Usually, dt_get_address() will fail because it can't translate the address, not because the size is 0. The more... > > As I thought the 0 it was reporting was an address of 0. Perhaps the > message should instead be: > "Unable to retrieve address for index %u of %s\n"? > --- > xen/arch/arm/domain_build.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c > index e824ba34b0..0b83384bd3 100644 > --- a/xen/arch/arm/domain_build.c > +++ b/xen/arch/arm/domain_build.c > @@ -1405,6 +1405,11 @@ static int __init handle_device(struct domain *d, struct dt_device_node *dev, > { > struct map_range_data mr_data = { .d = d, .p2mt = p2mt }; > res = dt_device_get_address(dev, i, &addr, &size); > + > + /* Some DT may describe empty bank, ignore them */ > + if ( !size ) > + continue; ... dt_device_get_address() will not set the size if the node is bogus. So you can't rely on either addr or size when res is non-zero. handle_device() (at least on unstable) will not initialize the two variables to 0. So I guess you are lucky that you compiler zeroed them for you, but that's not the normal behavior. So I think we first need to figure out what is the offending node and why it is dt_device_get_address() is returning an error for it. That said, I agree that we possibly want a check size == 0 (action TBD) in map_range_to_domain() as the code would do the wrong thing for 0. > if ( res ) > { > printk(XENLOG_ERR "Unable to retrieve address %u for %s\n", > Cheers,
On Mon, Dec 21, 2020 at 06:28:35PM +0000, Julien Grall wrote: > I was planning to review the first version today, but as you sent a new > version I will answer on this one directly. Mostly the commentary has been increasing, not so much the commit. > On 21/12/2020 17:30, Elliott Mitchell wrote: > > Previously Xen had stopped processing Device Trees if an empty > > (size == 0) memory bank was found. > > > > Commit 5a37207df52066efefe419c677b089a654d37afc changed this behavior to > > ignore such banks. Unfortunately this means these empty nodes are > > visible to code which accesses the device trees. Have domain_build also > > ignore these entries. > > I am probably missing something here. The commit you pointed out will > only take care of memory nodes (including reserved-memory). > > It should not be possible to reach handle_device() with actual RAM. > Although, it would with the reserved memory node. > > Could you provide a bit more details on the issue? In particular, I am > interested to see the offending node and its content. Define "see" in this context. The message which shows up is: "Unable to retrieve address 0 for /scb/pcie@7d500000/pci@1,0/usb@1,0" According to Linux "name", "reg" and "resets" exist there. (which as stated "0" looks suspiciously like NULL, rather than an index) > > I doubt this is the only bug exposed by > > 5a37207df52066efefe419c677b089a654d37afc. > > Are you saying that with my patch dropped, Xen will boot but with it > will not? Hmm, realizing that I'd found a fix, but not actually tested to confirm... Seems I had it wrong, dropping 5a37207df52066efefe419c677b089a654d37afc doesn't fix the issue, so that is NOT the cause. Sorry about misattributing the cause. Now to start doing builds to identify the cause... (this of course will take a bit) Hmm, this is being rather more difficult to identify than I expected. > > As I thought the 0 it was reporting was an address of 0. Perhaps the > > message should instead be: > > "Unable to retrieve address for index %u of %s\n"? > > --- > > xen/arch/arm/domain_build.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c > > index e824ba34b0..0b83384bd3 100644 > > --- a/xen/arch/arm/domain_build.c > > +++ b/xen/arch/arm/domain_build.c > > @@ -1405,6 +1405,11 @@ static int __init handle_device(struct domain *d, struct dt_device_node *dev, > > { > > struct map_range_data mr_data = { .d = d, .p2mt = p2mt }; > > res = dt_device_get_address(dev, i, &addr, &size); > > + > > + /* Some DT may describe empty bank, ignore them */ > > + if ( !size ) > > + continue; > > ... dt_device_get_address() will not set the size if the node is bogus. > So you can't rely on either addr or size when res is non-zero. > > handle_device() (at least on unstable) will not initialize the two > variables to 0. So I guess you are lucky that you compiler zeroed them > for you, but that's not the normal behavior. > > So I think we first need to figure out what is the offending node and > why it is dt_device_get_address() is returning an error for it. > > That said, I agree that we possibly want a check size == 0 (action TBD) > in map_range_to_domain() as the code would do the wrong thing for 0. Seems like returning to a working build without that commit is going to take a bit. :-(
On Mon, Dec 21, 2020 at 06:28:35PM +0000, Julien Grall wrote: > On 21/12/2020 17:30, Elliott Mitchell wrote: > > I doubt this is the only bug exposed by > > 5a37207df52066efefe419c677b089a654d37afc. > > Are you saying that with my patch dropped, Xen will boot but with it > will not? I thought that was the cause. Yet after a bunch of builds trying to ensure I can cause it to reproduce or not, I wasn't able to. As such I now think this is a misattribution. :-( Other candidate on my radar is this showed up near the time I started trying other kernel sources. I now wonder if this is due to the device-trees being produced with recent RPF kernels versus those being produced with pure mainline. Presently I'm using a 5.10 RPF kernel and device-tree. > So I think we first need to figure out what is the offending node and > why it is dt_device_get_address() is returning an error for it. > > That said, I agree that we possibly want a check size == 0 (action TBD) > in map_range_to_domain() as the code would do the wrong thing for 0. Already stated "/scb/pcie@7d500000/pci@1,0/usb@1,0". Perhaps the code should be ignoring nodes for which which dt_device_get_address() fails, or perhaps this should only be done for Domain 0 (where it results in panic).
On Mon, 4 Jan 2021, Elliott Mitchell wrote: > On Mon, Dec 21, 2020 at 06:28:35PM +0000, Julien Grall wrote: > > On 21/12/2020 17:30, Elliott Mitchell wrote: > > > I doubt this is the only bug exposed by > > > 5a37207df52066efefe419c677b089a654d37afc. > > > > Are you saying that with my patch dropped, Xen will boot but with it > > will not? > > I thought that was the cause. Yet after a bunch of builds trying to > ensure I can cause it to reproduce or not, I wasn't able to. As such I > now think this is a misattribution. :-( > > Other candidate on my radar is this showed up near the time I started > trying other kernel sources. I now wonder if this is due to the > device-trees being produced with recent RPF kernels versus those being > produced with pure mainline. Presently I'm using a 5.10 RPF kernel and > device-tree. > > > > So I think we first need to figure out what is the offending node and > > why it is dt_device_get_address() is returning an error for it. > > > > That said, I agree that we possibly want a check size == 0 (action TBD) > > in map_range_to_domain() as the code would do the wrong thing for 0. > > Already stated "/scb/pcie@7d500000/pci@1,0/usb@1,0". Can you please post the full node for usb@1,0? I would like to check the corresponding device tree binding to see the expected format. > Perhaps the code should be ignoring nodes for which > which dt_device_get_address() fails, or perhaps this should only be done > for Domain 0 (where it results in panic). That seems reasonable
On Mon, Jan 11, 2021 at 04:49:57PM -0800, Stefano Stabellini wrote: > On Mon, 4 Jan 2021, Elliott Mitchell wrote: > > On Mon, Dec 21, 2020 at 06:28:35PM +0000, Julien Grall wrote: > > > So I think we first need to figure out what is the offending node and > > > why it is dt_device_get_address() is returning an error for it. > > > > > > That said, I agree that we possibly want a check size == 0 (action TBD) > > > in map_range_to_domain() as the code would do the wrong thing for 0. > > > > Already stated "/scb/pcie@7d500000/pci@1,0/usb@1,0". > > Can you please post the full node for usb@1,0? I would like to check the > corresponding device tree binding to see the expected format. > > > > Perhaps the code should be ignoring nodes for which > > which dt_device_get_address() fails, or perhaps this should only be done > > for Domain 0 (where it results in panic). > > That seems reasonable What is the best approach to passing it along? The dtb file is 46KB, figuring out the right dts file isn't always simple. I can state it was at commit cff5fa15e5d11758db426eee3524a5dfded3c62b of https://github.com/raspberrypi/linux.git No matter one's view on GitHub, this is also viewable online: https://github.com/raspberrypi/linux/tree/cff5fa15e5d11758db426eee3524a5dfded3c62b/arch/arm/boot/dts
diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c index e824ba34b0..0b83384bd3 100644 --- a/xen/arch/arm/domain_build.c +++ b/xen/arch/arm/domain_build.c @@ -1405,6 +1405,11 @@ static int __init handle_device(struct domain *d, struct dt_device_node *dev, { struct map_range_data mr_data = { .d = d, .p2mt = p2mt }; res = dt_device_get_address(dev, i, &addr, &size); + + /* Some DT may describe empty bank, ignore them */ + if ( !size ) + continue; + if ( res ) { printk(XENLOG_ERR "Unable to retrieve address %u for %s\n",