Message ID | 1625080041-29010-3-git-send-email-eric.devolder@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | acpi: Error Record Serialization Table, ERST, support for QEMU | expand |
Oops, at the end of the 4th paragraph, I meant to state that "Linux does not support the NVRAM mode." rather than "non-NVRAM mode", which contradicts everything I stated prior. Eric.
On Wed, 30 Jun 2021 19:26:39 +0000 Eric DeVolder <eric.devolder@oracle.com> wrote: > Oops, at the end of the 4th paragraph, I meant to state that "Linux does not support the NVRAM mode." > rather than "non-NVRAM mode", which contradicts everything I stated prior. > Eric. > ________________________________ > From: Eric DeVolder <eric.devolder@oracle.com> > Sent: Wednesday, June 30, 2021 2:07 PM > To: qemu-devel@nongnu.org <qemu-devel@nongnu.org> > Cc: mst@redhat.com <mst@redhat.com>; imammedo@redhat.com <imammedo@redhat.com>; marcel.apfelbaum@gmail.com <marcel.apfelbaum@gmail.com>; pbonzini@redhat.com <pbonzini@redhat.com>; rth@twiddle.net <rth@twiddle.net>; ehabkost@redhat.com <ehabkost@redhat.com>; Konrad Wilk <konrad.wilk@oracle.com>; Boris Ostrovsky <boris.ostrovsky@oracle.com> > Subject: [PATCH v5 02/10] ACPI ERST: specification for ERST support > > Information on the implementation of the ACPI ERST support. > > Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> > --- > docs/specs/acpi_erst.txt | 152 +++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 152 insertions(+) > create mode 100644 docs/specs/acpi_erst.txt > > diff --git a/docs/specs/acpi_erst.txt b/docs/specs/acpi_erst.txt > new file mode 100644 > index 0000000..79f8eb9 > --- /dev/null > +++ b/docs/specs/acpi_erst.txt > @@ -0,0 +1,152 @@ > +ACPI ERST DEVICE > +================ > + > +The ACPI ERST device is utilized to support the ACPI Error Record > +Serialization Table, ERST, functionality. The functionality is > +designed for storing error records in persistent storage for > +future reference/debugging. > + > +The ACPI specification[1], in Chapter "ACPI Platform Error Interfaces > +(APEI)", and specifically subsection "Error Serialization", outlines > +a method for storing error records into persistent storage. > + > +The format of error records is described in the UEFI specification[2], > +in Appendix N "Common Platform Error Record". > + > +While the ACPI specification allows for an NVRAM "mode" (see > +GET_ERROR_LOG_ADDRESS_RANGE_ATTRIBUTES) where non-volatile RAM is > +directly exposed for direct access by the OS/guest, this implements > +the non-NVRAM "mode". This non-NVRAM "mode" is what is implemented > +by most BIOS (since flash memory requires programming operations > +in order to update its contents). Furthermore, as of the time of this > +writing, Linux does not support the non-NVRAM "mode". shouldn't it be s/non-NVRAM/NVRAM/ ? > + > + > +Background/Motivation > +--------------------- > +Linux uses the persistent storage filesystem, pstore, to record > +information (eg. dmesg tail) upon panics and shutdowns. Pstore is > +independent of, and runs before, kdump. In certain scenarios (ie. > +hosts/guests with root filesystems on NFS/iSCSI where networking > +software and/or hardware fails), pstore may contain the only > +information available for post-mortem debugging. well, it's not the only way, one can use existing pvpanic device to notify mgmt layer about crash and mgmt layer can take appropriate measures to for post-mortem debugging, including dumping guest state, which is superior to anything pstore can offer as VM is still exists and mgmt layer can inspect VMs crashed state directly or dump necessary parts of it. So ERST shouldn't be portrayed as the only way here but rather as limited alternative to pvpanic in regards to post-mortem debugging (it's the only way only on bare-metal). It would be better to describe here other use-cases you've mentioned in earlier reviews, that justify adding alternative to pvpanic. > +Two common storage backends for the pstore filesystem are ACPI ERST > +and UEFI. Most BIOS implement ACPI ERST. UEFI is not utilized in > +all guests. With QEMU supporting ACPI ERST, it becomes a viable > +pstore storage backend for virtual machines (as it is now for > +bare metal machines). > + > +Enabling support for ACPI ERST facilitates a consistent method to > +capture kernel panic information in a wide range of guests: from > +resource-constrained microvms to very large guests, and in > +particular, in direct-boot environments (which would lack UEFI > +run-time services). this hunk probably not necessary > + > +Note that Microsoft Windows also utilizes the ACPI ERST for certain > +crash information, if available. a pointer to a relevant source would be helpful here. > +Invocation s/^^/Configuration|Usage/ > +---------- > + > +To utilize ACPI ERST, a memory-backend-file object and acpi-erst s/utilize/use/ > +device must be created, for example: s/must/can/ > + > + qemu ... > + -object memory-backend-file,id=erstnvram,mem-path=acpi-erst.backing, > + size=0x10000,share=on I'd put ^^^ on the same line as -object and use '\' at the end the so example could be easily copy-pasted > + -device acpi-erst,memdev=erstnvram > + > +For proper operation, the ACPI ERST device needs a memory-backend-file > +object with the following parameters: > + > + - id: The id of the memory-backend-file object is used to associate > + this memory with the acpi-erst device. > + - size: The size of the ACPI ERST backing storage. This parameter is > + required. > + - mem-path: The location of the ACPI ERST backing storage file. This > + parameter is also required. > + - share: The share=on parameter is required so that updates to the > + ERST back store are written to the file immediately as well. Without > + it, updates the the backing file are unpredictable and may not > + properly persist (eg. if qemu should crash). mmap manpage says: MAP_SHARED Updates to the mapping ... are carried through to the underlying file. it doesn't guarantee 'written to the file immediately', though. So I'd rephrase it to something like that: - share: The share=on parameter is required so that updates to the ERST back store are written back to the file. > + > +The ACPI ERST device is a simple PCI device, and requires this one > +parameter: s/^.*:/and ERST device:/ > + > + - memdev: Is the object id of the memory-backend-file. > + > + > +PCI Interface > +------------- > + > +The ERST device is a PCI device with two BARs, one for accessing > +the programming registers, and the other for accessing the > +record exchange buffer. > + > +BAR0 contains the programming interface consisting of just two > +64-bit registers. The two registers are an ACTION (cmd) and a > +VALUE (data). All ERST actions/operations/side effects happen s/consisting of... All ERST/consisting of ACTION and VALUE 64-bit registers. All ERST/ > +on the write to the ACTION, by design. Thus any data needed s/Thus// > +by the action must be placed into VALUE prior to writing > +ACTION. Reading the VALUE simply returns the register contents, > +which can be updated by a previous ACTION. > This behavior is > +encoded in the ACPI ERST table generated by QEMU. it's too vague, Either drop sentence or add a reference to relevant place in spec. > + > +BAR1 contains the record exchange buffer, and the size of this > +buffer sets the maximum record size. This record exchange > +buffer size is 8KiB. s/^^^/ BAR1 contains the 8KiB record exchange buffer, which is the implemented maximum record size limit. > +Backing File s/^^^/Backing Storage Format/ > +------------ > + > +The ACPI ERST persistent storage is contained within a single backing > +file. The size and location of the backing file is specified upon > +QEMU startup of the ACPI ERST device. I'd drop above paragraph and describe file format here, ultimately used backend doesn't have to be a file. For example if user doesn't need it persist over QEMU restarts, ram backend could be used, guest will still be able to see it's own crash log after guest is reboot, or it could be memfd backend passed to QEMU by mgmt layer. > +Records are stored in the backing file in a simple fashion. s/backing file/backend storage/ ditto for other occurrences > +The backing file is essentially divided into fixed size > +"slots", ERST_RECORD_SIZE in length, with each "slot" > +storing a single record. > No attempt at optimizing storage > +through compression, compaction, etc is attempted. s/^^^// > +NOTE that any change to this value will make any pre- > +existing backing files, not of the same ERST_RECORD_SIZE, > +unusable to the guest. when that can happen, can we detect it and error out? > +Below is an example layout of the backing store file. > +The size of the file is a multiple of ERST_RECORD_SIZE, > +and contains N number of "slots" to store records. The > +example below shows two records (in CPER format) in the > +backing file, while the remaining slots are empty/ > +available. > + > + Slot Record > + +--------------------------------------------+ > + 0 | empty/available | > + +--------------------------------------------+ > + 1 | CPER | > + +--------------------------------------------+ > + 2 | CPER | > + +--------------------------------------------+ > + ... | | > + +--------------------------------------------+ > + N | empty/available | > + +--------------------------------------------+ > + <-------------- ERST_RECORD_SIZE ------------> > +Not all slots need to be occupied, and they need not be > +occupied in a contiguous fashion. The ability to clear/erase > +specific records allows for the formation of unoccupied > +slots. I'd drop this as not necessary > + > + > +References > +---------- > + > +[1] "Advanced Configuration and Power Interface Specification", > + version 4.0, June 2009. > + > +[2] "Unified Extensible Firmware Interface Specification", > + version 2.1, October 2008. > + > -- > 1.8.3.1 >
On 7/19/21 10:02 AM, Igor Mammedov wrote: > On Wed, 30 Jun 2021 19:26:39 +0000 > Eric DeVolder <eric.devolder@oracle.com> wrote: > >> Oops, at the end of the 4th paragraph, I meant to state that "Linux does not support the NVRAM mode." >> rather than "non-NVRAM mode", which contradicts everything I stated prior. >> Eric. >> ________________________________ >> From: Eric DeVolder <eric.devolder@oracle.com> >> Sent: Wednesday, June 30, 2021 2:07 PM >> To: qemu-devel@nongnu.org <qemu-devel@nongnu.org> >> Cc: mst@redhat.com <mst@redhat.com>; imammedo@redhat.com <imammedo@redhat.com>; marcel.apfelbaum@gmail.com <marcel.apfelbaum@gmail.com>; pbonzini@redhat.com <pbonzini@redhat.com>; rth@twiddle.net <rth@twiddle.net>; ehabkost@redhat.com <ehabkost@redhat.com>; Konrad Wilk <konrad.wilk@oracle.com>; Boris Ostrovsky <boris.ostrovsky@oracle.com> >> Subject: [PATCH v5 02/10] ACPI ERST: specification for ERST support >> >> Information on the implementation of the ACPI ERST support. >> >> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> >> --- >> docs/specs/acpi_erst.txt | 152 +++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 152 insertions(+) >> create mode 100644 docs/specs/acpi_erst.txt >> >> diff --git a/docs/specs/acpi_erst.txt b/docs/specs/acpi_erst.txt >> new file mode 100644 >> index 0000000..79f8eb9 >> --- /dev/null >> +++ b/docs/specs/acpi_erst.txt >> @@ -0,0 +1,152 @@ >> +ACPI ERST DEVICE >> +================ >> + >> +The ACPI ERST device is utilized to support the ACPI Error Record >> +Serialization Table, ERST, functionality. The functionality is >> +designed for storing error records in persistent storage for >> +future reference/debugging. >> + >> +The ACPI specification[1], in Chapter "ACPI Platform Error Interfaces >> +(APEI)", and specifically subsection "Error Serialization", outlines >> +a method for storing error records into persistent storage. >> + >> +The format of error records is described in the UEFI specification[2], >> +in Appendix N "Common Platform Error Record". >> + >> +While the ACPI specification allows for an NVRAM "mode" (see >> +GET_ERROR_LOG_ADDRESS_RANGE_ATTRIBUTES) where non-volatile RAM is >> +directly exposed for direct access by the OS/guest, this implements >> +the non-NVRAM "mode". This non-NVRAM "mode" is what is implemented >> +by most BIOS (since flash memory requires programming operations >> +in order to update its contents). Furthermore, as of the time of this >> +writing, Linux does not support the non-NVRAM "mode". > > shouldn't it be s/non-NVRAM/NVRAM/ ? Yes, it has been corrected. > >> + >> + >> +Background/Motivation >> +--------------------- >> +Linux uses the persistent storage filesystem, pstore, to record >> +information (eg. dmesg tail) upon panics and shutdowns. Pstore is >> +independent of, and runs before, kdump. In certain scenarios (ie. >> +hosts/guests with root filesystems on NFS/iSCSI where networking >> +software and/or hardware fails), pstore may contain the only >> +information available for post-mortem debugging. > > well, > it's not the only way, one can use existing pvpanic device to notify > mgmt layer about crash and mgmt layer can take appropriate measures > to for post-mortem debugging, including dumping guest state, > which is superior to anything pstore can offer as VM is still exists > and mgmt layer can inspect VMs crashed state directly or dump > necessary parts of it. > > So ERST shouldn't be portrayed as the only way here but rather > as limited alternative to pvpanic in regards to post-mortem debugging > (it's the only way only on bare-metal). > > It would be better to describe here other use-cases you've mentioned > in earlier reviews, that justify adding alternative to pvpanic. I'm not sure how I would change this. I do say "may contain", which means it is not the only way. Pvpanic is a way to notify the mgmt layer/host, but this is a method solely with the guest. Each serves a different purpose; plugs a different hole. As noted in a separate message, my company has intentions of storing other data in ERST beyond panics. > >> +Two common storage backends for the pstore filesystem are ACPI ERST >> +and UEFI. Most BIOS implement ACPI ERST. UEFI is not utilized in >> +all guests. With QEMU supporting ACPI ERST, it becomes a viable >> +pstore storage backend for virtual machines (as it is now for >> +bare metal machines). >> + > >> +Enabling support for ACPI ERST facilitates a consistent method to >> +capture kernel panic information in a wide range of guests: from >> +resource-constrained microvms to very large guests, and in >> +particular, in direct-boot environments (which would lack UEFI >> +run-time services). > this hunk probably not necessary > >> + >> +Note that Microsoft Windows also utilizes the ACPI ERST for certain >> +crash information, if available. > a pointer to a relevant source would be helpful here. I've included the reference, here for your benefit. Windows Hardware Error Architecutre, specifically Persistence Mechanism https://docs.microsoft.com/en-us/windows-hardware/drivers/whea/error-record-persistence-mechanism > >> +Invocation > s/^^/Configuration|Usage/ Corrected > >> +---------- >> + >> +To utilize ACPI ERST, a memory-backend-file object and acpi-erst > s/utilize/use/ Corrected > >> +device must be created, for example: > s/must/can/ Corrected > >> + >> + qemu ... >> + -object memory-backend-file,id=erstnvram,mem-path=acpi-erst.backing, >> + size=0x10000,share=on > I'd put ^^^ on the same line as -object and use '\' at the end the > so example could be easily copy-pasted Corrected > >> + -device acpi-erst,memdev=erstnvram >> + >> +For proper operation, the ACPI ERST device needs a memory-backend-file >> +object with the following parameters: >> + >> + - id: The id of the memory-backend-file object is used to associate >> + this memory with the acpi-erst device. >> + - size: The size of the ACPI ERST backing storage. This parameter is >> + required. >> + - mem-path: The location of the ACPI ERST backing storage file. This >> + parameter is also required. >> + - share: The share=on parameter is required so that updates to the >> + ERST back store are written to the file immediately as well. Without >> + it, updates the the backing file are unpredictable and may not >> + properly persist (eg. if qemu should crash). > > mmap manpage says: > MAP_SHARED > Updates to the mapping ... are carried through to the underlying file. > it doesn't guarantee 'written to the file immediately', though. > So I'd rephrase it to something like that: > > - share: The share=on parameter is required so that updates to the ERST back store > are written back to the file. Corrected > >> + >> +The ACPI ERST device is a simple PCI device, and requires this one >> +parameter: > s/^.*:/and ERST device:/ Corrected > >> + >> + - memdev: Is the object id of the memory-backend-file. >> + >> + >> +PCI Interface >> +------------- >> + >> +The ERST device is a PCI device with two BARs, one for accessing >> +the programming registers, and the other for accessing the >> +record exchange buffer. >> + >> +BAR0 contains the programming interface consisting of just two >> +64-bit registers. The two registers are an ACTION (cmd) and a >> +VALUE (data). All ERST actions/operations/side effects happen > s/consisting of... All ERST/consisting of ACTION and VALUE 64-bit registers. All ERST/ Corrected > >> +on the write to the ACTION, by design. Thus any data needed > s/Thus// Corrected > >> +by the action must be placed into VALUE prior to writing >> +ACTION. Reading the VALUE simply returns the register contents, >> +which can be updated by a previous ACTION. > >> This behavior is >> +encoded in the ACPI ERST table generated by QEMU. > it's too vague, Either drop sentence or add a reference to relevant place in spec. Corrected > > >> + >> +BAR1 contains the record exchange buffer, and the size of this >> +buffer sets the maximum record size. This record exchange >> +buffer size is 8KiB. > s/^^^/ > BAR1 contains the 8KiB record exchange buffer, which is the implemented maximum record size limit. Corrected > > >> +Backing File > > s/^^^/Backing Storage Format/ Corrected > >> +------------ > > >> + >> +The ACPI ERST persistent storage is contained within a single backing >> +file. The size and location of the backing file is specified upon >> +QEMU startup of the ACPI ERST device. > > I'd drop above paragraph and describe file format here, > ultimately used backend doesn't have to be a file. For > example if user doesn't need it persist over QEMU restarts, > ram backend could be used, guest will still be able to see > it's own crash log after guest is reboot, or it could be > memfd backend passed to QEMU by mgmt layer. Dropped > > >> +Records are stored in the backing file in a simple fashion. > s/backing file/backend storage/ > ditto for other occurrences Corrected > >> +The backing file is essentially divided into fixed size >> +"slots", ERST_RECORD_SIZE in length, with each "slot" >> +storing a single record. > >> No attempt at optimizing storage >> +through compression, compaction, etc is attempted. > s/^^^// I'd like to keep this statement. It is there because in a number of hardware BIOS I tested, these kinds of features lead to bugs in the ERST support. > >> +NOTE that any change to this value will make any pre- >> +existing backing files, not of the same ERST_RECORD_SIZE, >> +unusable to the guest. > when that can happen, can we detect it and error out? I've dropped this statement. That value is hard coded, and not a parameter, so there is no simple way to change it. This comment does exist next to the ERST_RECORD_SIZE declaration in the code. > > >> +Below is an example layout of the backing store file. >> +The size of the file is a multiple of ERST_RECORD_SIZE, >> +and contains N number of "slots" to store records. The >> +example below shows two records (in CPER format) in the >> +backing file, while the remaining slots are empty/ >> +available. >> + >> + Slot Record >> + +--------------------------------------------+ >> + 0 | empty/available | >> + +--------------------------------------------+ >> + 1 | CPER | >> + +--------------------------------------------+ >> + 2 | CPER | >> + +--------------------------------------------+ >> + ... | | >> + +--------------------------------------------+ >> + N | empty/available | >> + +--------------------------------------------+ >> + <-------------- ERST_RECORD_SIZE ------------> > > >> +Not all slots need to be occupied, and they need not be >> +occupied in a contiguous fashion. The ability to clear/erase >> +specific records allows for the formation of unoccupied >> +slots. > I'd drop this as not necessary I'd like to keep this statement. Again, several BIOS on which I tested ERST had bugs around non-contiguous record storage. > > >> + >> + >> +References >> +---------- >> + >> +[1] "Advanced Configuration and Power Interface Specification", >> + version 4.0, June 2009. >> + >> +[2] "Unified Extensible Firmware Interface Specification", >> + version 2.1, October 2008. >> + >> -- >> 1.8.3.1 >> >
On Wed, 21 Jul 2021 10:42:33 -0500 Eric DeVolder <eric.devolder@oracle.com> wrote: > On 7/19/21 10:02 AM, Igor Mammedov wrote: > > On Wed, 30 Jun 2021 19:26:39 +0000 > > Eric DeVolder <eric.devolder@oracle.com> wrote: > > > >> Oops, at the end of the 4th paragraph, I meant to state that "Linux does not support the NVRAM mode." > >> rather than "non-NVRAM mode", which contradicts everything I stated prior. > >> Eric. > >> ________________________________ > >> From: Eric DeVolder <eric.devolder@oracle.com> > >> Sent: Wednesday, June 30, 2021 2:07 PM > >> To: qemu-devel@nongnu.org <qemu-devel@nongnu.org> > >> Cc: mst@redhat.com <mst@redhat.com>; imammedo@redhat.com <imammedo@redhat.com>; marcel.apfelbaum@gmail.com <marcel.apfelbaum@gmail.com>; pbonzini@redhat.com <pbonzini@redhat.com>; rth@twiddle.net <rth@twiddle.net>; ehabkost@redhat.com <ehabkost@redhat.com>; Konrad Wilk <konrad.wilk@oracle.com>; Boris Ostrovsky <boris.ostrovsky@oracle.com> > >> Subject: [PATCH v5 02/10] ACPI ERST: specification for ERST support > >> > >> Information on the implementation of the ACPI ERST support. > >> > >> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> > >> --- > >> docs/specs/acpi_erst.txt | 152 +++++++++++++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 152 insertions(+) > >> create mode 100644 docs/specs/acpi_erst.txt > >> > >> diff --git a/docs/specs/acpi_erst.txt b/docs/specs/acpi_erst.txt > >> new file mode 100644 > >> index 0000000..79f8eb9 > >> --- /dev/null > >> +++ b/docs/specs/acpi_erst.txt > >> @@ -0,0 +1,152 @@ > >> +ACPI ERST DEVICE > >> +================ > >> + > >> +The ACPI ERST device is utilized to support the ACPI Error Record > >> +Serialization Table, ERST, functionality. The functionality is > >> +designed for storing error records in persistent storage for > >> +future reference/debugging. > >> + > >> +The ACPI specification[1], in Chapter "ACPI Platform Error Interfaces > >> +(APEI)", and specifically subsection "Error Serialization", outlines > >> +a method for storing error records into persistent storage. > >> + > >> +The format of error records is described in the UEFI specification[2], > >> +in Appendix N "Common Platform Error Record". > >> + > >> +While the ACPI specification allows for an NVRAM "mode" (see > >> +GET_ERROR_LOG_ADDRESS_RANGE_ATTRIBUTES) where non-volatile RAM is > >> +directly exposed for direct access by the OS/guest, this implements > >> +the non-NVRAM "mode". This non-NVRAM "mode" is what is implemented > >> +by most BIOS (since flash memory requires programming operations > >> +in order to update its contents). Furthermore, as of the time of this > >> +writing, Linux does not support the non-NVRAM "mode". > > > > shouldn't it be s/non-NVRAM/NVRAM/ ? > > Yes, it has been corrected. > > > > >> + > >> + > >> +Background/Motivation > >> +--------------------- > >> +Linux uses the persistent storage filesystem, pstore, to record > >> +information (eg. dmesg tail) upon panics and shutdowns. Pstore is > >> +independent of, and runs before, kdump. In certain scenarios (ie. > >> +hosts/guests with root filesystems on NFS/iSCSI where networking > >> +software and/or hardware fails), pstore may contain the only > >> +information available for post-mortem debugging. > > > > well, > > it's not the only way, one can use existing pvpanic device to notify > > mgmt layer about crash and mgmt layer can take appropriate measures > > to for post-mortem debugging, including dumping guest state, > > which is superior to anything pstore can offer as VM is still exists > > and mgmt layer can inspect VMs crashed state directly or dump > > necessary parts of it. > > > > So ERST shouldn't be portrayed as the only way here but rather > > as limited alternative to pvpanic in regards to post-mortem debugging > > (it's the only way only on bare-metal). > > > > It would be better to describe here other use-cases you've mentioned > > in earlier reviews, that justify adding alternative to pvpanic. > > I'm not sure how I would change this. I do say "may contain", which means it > is not the only way. Pvpanic is a way to notify the mgmt layer/host, but > this is a method solely with the guest. Each serves a different purpose; > plugs a different hole. > I'd suggest edit "pstore may contain the only information" as "pstore may contain information" > As noted in a separate message, my company has intentions of storing other > data in ERST beyond panics. perhaps add your use cases here as well. > >> +Two common storage backends for the pstore filesystem are ACPI ERST > >> +and UEFI. Most BIOS implement ACPI ERST. UEFI is not utilized in > >> +all guests. With QEMU supporting ACPI ERST, it becomes a viable > >> +pstore storage backend for virtual machines (as it is now for > >> +bare metal machines). > >> + > > > >> +Enabling support for ACPI ERST facilitates a consistent method to > >> +capture kernel panic information in a wide range of guests: from > >> +resource-constrained microvms to very large guests, and in > >> +particular, in direct-boot environments (which would lack UEFI > >> +run-time services). > > this hunk probably not necessary > > > >> + > >> +Note that Microsoft Windows also utilizes the ACPI ERST for certain > >> +crash information, if available. > > a pointer to a relevant source would be helpful here. > > I've included the reference, here for your benefit. > Windows Hardware Error Architecutre, specifically Persistence Mechanism > https://docs.microsoft.com/en-us/windows-hardware/drivers/whea/error-record-persistence-mechanism > > > > >> +Invocation > > s/^^/Configuration|Usage/ > > Corrected > > > > >> +---------- > >> + > >> +To utilize ACPI ERST, a memory-backend-file object and acpi-erst > > s/utilize/use/ > > Corrected > > > > >> +device must be created, for example: > > s/must/can/ > > Corrected > > > > >> + > >> + qemu ... > >> + -object memory-backend-file,id=erstnvram,mem-path=acpi-erst.backing, > >> + size=0x10000,share=on > > I'd put ^^^ on the same line as -object and use '\' at the end the > > so example could be easily copy-pasted > > Corrected > > > > >> + -device acpi-erst,memdev=erstnvram > >> + > >> +For proper operation, the ACPI ERST device needs a memory-backend-file > >> +object with the following parameters: > >> + > >> + - id: The id of the memory-backend-file object is used to associate > >> + this memory with the acpi-erst device. > >> + - size: The size of the ACPI ERST backing storage. This parameter is > >> + required. > >> + - mem-path: The location of the ACPI ERST backing storage file. This > >> + parameter is also required. > >> + - share: The share=on parameter is required so that updates to the > >> + ERST back store are written to the file immediately as well. Without > >> + it, updates the the backing file are unpredictable and may not > >> + properly persist (eg. if qemu should crash). > > > > mmap manpage says: > > MAP_SHARED > > Updates to the mapping ... are carried through to the underlying file. > > it doesn't guarantee 'written to the file immediately', though. > > So I'd rephrase it to something like that: > > > > - share: The share=on parameter is required so that updates to the ERST back store > > are written back to the file. > > Corrected > > > > >> + > >> +The ACPI ERST device is a simple PCI device, and requires this one > >> +parameter: > > s/^.*:/and ERST device:/ > > Corrected > > > > >> + > >> + - memdev: Is the object id of the memory-backend-file. > >> + > >> + > >> +PCI Interface > >> +------------- > >> + > >> +The ERST device is a PCI device with two BARs, one for accessing > >> +the programming registers, and the other for accessing the > >> +record exchange buffer. > >> + > >> +BAR0 contains the programming interface consisting of just two > >> +64-bit registers. The two registers are an ACTION (cmd) and a > >> +VALUE (data). All ERST actions/operations/side effects happen > > s/consisting of... All ERST/consisting of ACTION and VALUE 64-bit registers. All ERST/ > > Corrected > > > > >> +on the write to the ACTION, by design. Thus any data needed > > s/Thus// > Corrected > > > > >> +by the action must be placed into VALUE prior to writing > >> +ACTION. Reading the VALUE simply returns the register contents, > >> +which can be updated by a previous ACTION. > > > >> This behavior is > >> +encoded in the ACPI ERST table generated by QEMU. > > it's too vague, Either drop sentence or add a reference to relevant place in spec. > Corrected > > > > > > >> + > >> +BAR1 contains the record exchange buffer, and the size of this > >> +buffer sets the maximum record size. This record exchange > >> +buffer size is 8KiB. > > s/^^^/ > > BAR1 contains the 8KiB record exchange buffer, which is the implemented maximum record size limit. > Corrected > > > > > > >> +Backing File > > > > s/^^^/Backing Storage Format/ > Corrected > > > > >> +------------ > > > > > >> + > >> +The ACPI ERST persistent storage is contained within a single backing > >> +file. The size and location of the backing file is specified upon > >> +QEMU startup of the ACPI ERST device. > > > > I'd drop above paragraph and describe file format here, > > ultimately used backend doesn't have to be a file. For > > example if user doesn't need it persist over QEMU restarts, > > ram backend could be used, guest will still be able to see > > it's own crash log after guest is reboot, or it could be > > memfd backend passed to QEMU by mgmt layer. > Dropped > > > > > > >> +Records are stored in the backing file in a simple fashion. > > s/backing file/backend storage/ > > ditto for other occurrences > Corrected > > > > >> +The backing file is essentially divided into fixed size > >> +"slots", ERST_RECORD_SIZE in length, with each "slot" > >> +storing a single record. > > > >> No attempt at optimizing storage > >> +through compression, compaction, etc is attempted. > > s/^^^// > > I'd like to keep this statement. It is there because in a number of > hardware BIOS I tested, these kinds of features lead to bugs in the > ERST support. this doc it's not about issues in other BIOSes, it's about conrete QEMU impl. So sentence starting with "No attempt" is not relevant here at all. > >> +NOTE that any change to this value will make any pre- > >> +existing backing files, not of the same ERST_RECORD_SIZE, > >> +unusable to the guest. > > when that can happen, can we detect it and error out? > I've dropped this statement. That value is hard coded, and not a > parameter, so there is no simple way to change it. This comment > does exist next to the ERST_RECORD_SIZE declaration in the code. It's not a problem with current impl. but rather with possible future expansion. If you'd add a header with record size at the start of storage, it wouldn't be issue as ERST would be able to get used record size for storage. That will help with avoiding compat issues later on. > >> +Below is an example layout of the backing store file. > >> +The size of the file is a multiple of ERST_RECORD_SIZE, > >> +and contains N number of "slots" to store records. The > >> +example below shows two records (in CPER format) in the > >> +backing file, while the remaining slots are empty/ > >> +available. > >> + > >> + Slot Record > >> + +--------------------------------------------+ > >> + 0 | empty/available | > >> + +--------------------------------------------+ > >> + 1 | CPER | > >> + +--------------------------------------------+ > >> + 2 | CPER | > >> + +--------------------------------------------+ > >> + ... | | > >> + +--------------------------------------------+ > >> + N | empty/available | > >> + +--------------------------------------------+ > >> + <-------------- ERST_RECORD_SIZE ------------> > > > > > >> +Not all slots need to be occupied, and they need not be > >> +occupied in a contiguous fashion. The ability to clear/erase > >> +specific records allows for the formation of unoccupied > >> +slots. > > I'd drop this as not necessary > > I'd like to keep this statement. Again, several BIOS on which I tested > ERST had bugs around non-contiguous record storage. I'd drop this and alter sentence above ending with " in a simple fashion." to describe sparse usage of storage and then after that comes example diagram. I'd like this part to start with unambiguous concise description of format and to be finished with example diagram. It is the part that will be considered as the the only true source how file should be formatted, when it comes to fixing bugs or modifying original impl. later on. So it's important to have clear description without any unnecessary information here. > > > > > >> + > >> + > >> +References > >> +---------- > >> + > >> +[1] "Advanced Configuration and Power Interface Specification", > >> + version 4.0, June 2009. > >> + > >> +[2] "Unified Extensible Firmware Interface Specification", > >> + version 2.1, October 2008. > >> + > >> -- > >> 1.8.3.1 > >> > > >
On 7/26/21 5:06 AM, Igor Mammedov wrote: > On Wed, 21 Jul 2021 10:42:33 -0500 > Eric DeVolder <eric.devolder@oracle.com> wrote: > >> On 7/19/21 10:02 AM, Igor Mammedov wrote: >>> On Wed, 30 Jun 2021 19:26:39 +0000 >>> Eric DeVolder <eric.devolder@oracle.com> wrote: >>> >>>> Oops, at the end of the 4th paragraph, I meant to state that "Linux does not support the NVRAM mode." >>>> rather than "non-NVRAM mode", which contradicts everything I stated prior. >>>> Eric. >>>> ________________________________ >>>> From: Eric DeVolder <eric.devolder@oracle.com> >>>> Sent: Wednesday, June 30, 2021 2:07 PM >>>> To: qemu-devel@nongnu.org <qemu-devel@nongnu.org> >>>> Cc: mst@redhat.com <mst@redhat.com>; imammedo@redhat.com <imammedo@redhat.com>; marcel.apfelbaum@gmail.com <marcel.apfelbaum@gmail.com>; pbonzini@redhat.com <pbonzini@redhat.com>; rth@twiddle.net <rth@twiddle.net>; ehabkost@redhat.com <ehabkost@redhat.com>; Konrad Wilk <konrad.wilk@oracle.com>; Boris Ostrovsky <boris.ostrovsky@oracle.com> >>>> Subject: [PATCH v5 02/10] ACPI ERST: specification for ERST support >>>> >>>> Information on the implementation of the ACPI ERST support. >>>> >>>> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> >>>> --- >>>> docs/specs/acpi_erst.txt | 152 +++++++++++++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 152 insertions(+) >>>> create mode 100644 docs/specs/acpi_erst.txt >>>> >>>> diff --git a/docs/specs/acpi_erst.txt b/docs/specs/acpi_erst.txt >>>> new file mode 100644 >>>> index 0000000..79f8eb9 >>>> --- /dev/null >>>> +++ b/docs/specs/acpi_erst.txt >>>> @@ -0,0 +1,152 @@ >>>> +ACPI ERST DEVICE >>>> +================ >>>> + >>>> +The ACPI ERST device is utilized to support the ACPI Error Record >>>> +Serialization Table, ERST, functionality. The functionality is >>>> +designed for storing error records in persistent storage for >>>> +future reference/debugging. >>>> + >>>> +The ACPI specification[1], in Chapter "ACPI Platform Error Interfaces >>>> +(APEI)", and specifically subsection "Error Serialization", outlines >>>> +a method for storing error records into persistent storage. >>>> + >>>> +The format of error records is described in the UEFI specification[2], >>>> +in Appendix N "Common Platform Error Record". >>>> + >>>> +While the ACPI specification allows for an NVRAM "mode" (see >>>> +GET_ERROR_LOG_ADDRESS_RANGE_ATTRIBUTES) where non-volatile RAM is >>>> +directly exposed for direct access by the OS/guest, this implements >>>> +the non-NVRAM "mode". This non-NVRAM "mode" is what is implemented >>>> +by most BIOS (since flash memory requires programming operations >>>> +in order to update its contents). Furthermore, as of the time of this >>>> +writing, Linux does not support the non-NVRAM "mode". >>> >>> shouldn't it be s/non-NVRAM/NVRAM/ ? >> >> Yes, it has been corrected. >> >>> >>>> + >>>> + >>>> +Background/Motivation >>>> +--------------------- >>>> +Linux uses the persistent storage filesystem, pstore, to record >>>> +information (eg. dmesg tail) upon panics and shutdowns. Pstore is >>>> +independent of, and runs before, kdump. In certain scenarios (ie. >>>> +hosts/guests with root filesystems on NFS/iSCSI where networking >>>> +software and/or hardware fails), pstore may contain the only >>>> +information available for post-mortem debugging. >>> >>> well, >>> it's not the only way, one can use existing pvpanic device to notify >>> mgmt layer about crash and mgmt layer can take appropriate measures >>> to for post-mortem debugging, including dumping guest state, >>> which is superior to anything pstore can offer as VM is still exists >>> and mgmt layer can inspect VMs crashed state directly or dump >>> necessary parts of it. >>> >>> So ERST shouldn't be portrayed as the only way here but rather >>> as limited alternative to pvpanic in regards to post-mortem debugging >>> (it's the only way only on bare-metal). >>> >>> It would be better to describe here other use-cases you've mentioned >>> in earlier reviews, that justify adding alternative to pvpanic. >> >> I'm not sure how I would change this. I do say "may contain", which means it >> is not the only way. Pvpanic is a way to notify the mgmt layer/host, but >> this is a method solely with the guest. Each serves a different purpose; >> plugs a different hole. >> > > I'd suggest edit "pstore may contain the only information" as "pstore may contain information" > Done >> As noted in a separate message, my company has intentions of storing other >> data in ERST beyond panics. > perhaps add your use cases here as well. > > >>>> +Two common storage backends for the pstore filesystem are ACPI ERST >>>> +and UEFI. Most BIOS implement ACPI ERST. UEFI is not utilized in >>>> +all guests. With QEMU supporting ACPI ERST, it becomes a viable >>>> +pstore storage backend for virtual machines (as it is now for >>>> +bare metal machines). >>>> + >>> >>>> +Enabling support for ACPI ERST facilitates a consistent method to >>>> +capture kernel panic information in a wide range of guests: from >>>> +resource-constrained microvms to very large guests, and in >>>> +particular, in direct-boot environments (which would lack UEFI >>>> +run-time services). >>> this hunk probably not necessary >>> >>>> + >>>> +Note that Microsoft Windows also utilizes the ACPI ERST for certain >>>> +crash information, if available. >>> a pointer to a relevant source would be helpful here. >> >> I've included the reference, here for your benefit. >> Windows Hardware Error Architecutre, specifically Persistence Mechanism >> https://docs.microsoft.com/en-us/windows-hardware/drivers/whea/error-record-persistence-mechanism >> >>> >>>> +Invocation >>> s/^^/Configuration|Usage/ >> >> Corrected >> >>> >>>> +---------- >>>> + >>>> +To utilize ACPI ERST, a memory-backend-file object and acpi-erst >>> s/utilize/use/ >> >> Corrected >> >>> >>>> +device must be created, for example: >>> s/must/can/ >> >> Corrected >> >>> >>>> + >>>> + qemu ... >>>> + -object memory-backend-file,id=erstnvram,mem-path=acpi-erst.backing, >>>> + size=0x10000,share=on >>> I'd put ^^^ on the same line as -object and use '\' at the end the >>> so example could be easily copy-pasted >> >> Corrected >> >>> >>>> + -device acpi-erst,memdev=erstnvram >>>> + >>>> +For proper operation, the ACPI ERST device needs a memory-backend-file >>>> +object with the following parameters: >>>> + >>>> + - id: The id of the memory-backend-file object is used to associate >>>> + this memory with the acpi-erst device. >>>> + - size: The size of the ACPI ERST backing storage. This parameter is >>>> + required. >>>> + - mem-path: The location of the ACPI ERST backing storage file. This >>>> + parameter is also required. >>>> + - share: The share=on parameter is required so that updates to the >>>> + ERST back store are written to the file immediately as well. Without >>>> + it, updates the the backing file are unpredictable and may not >>>> + properly persist (eg. if qemu should crash). >>> >>> mmap manpage says: >>> MAP_SHARED >>> Updates to the mapping ... are carried through to the underlying file. >>> it doesn't guarantee 'written to the file immediately', though. >>> So I'd rephrase it to something like that: >>> >>> - share: The share=on parameter is required so that updates to the ERST back store >>> are written back to the file. >> >> Corrected >> >>> >>>> + >>>> +The ACPI ERST device is a simple PCI device, and requires this one >>>> +parameter: >>> s/^.*:/and ERST device:/ >> >> Corrected >> >>> >>>> + >>>> + - memdev: Is the object id of the memory-backend-file. >>>> + >>>> + >>>> +PCI Interface >>>> +------------- >>>> + >>>> +The ERST device is a PCI device with two BARs, one for accessing >>>> +the programming registers, and the other for accessing the >>>> +record exchange buffer. >>>> + >>>> +BAR0 contains the programming interface consisting of just two >>>> +64-bit registers. The two registers are an ACTION (cmd) and a >>>> +VALUE (data). All ERST actions/operations/side effects happen >>> s/consisting of... All ERST/consisting of ACTION and VALUE 64-bit registers. All ERST/ >> >> Corrected >> >>> >>>> +on the write to the ACTION, by design. Thus any data needed >>> s/Thus// >> Corrected >> >>> >>>> +by the action must be placed into VALUE prior to writing >>>> +ACTION. Reading the VALUE simply returns the register contents, >>>> +which can be updated by a previous ACTION. >>> >>>> This behavior is >>>> +encoded in the ACPI ERST table generated by QEMU. >>> it's too vague, Either drop sentence or add a reference to relevant place in spec. >> Corrected >> >>> >>> >>>> + >>>> +BAR1 contains the record exchange buffer, and the size of this >>>> +buffer sets the maximum record size. This record exchange >>>> +buffer size is 8KiB. >>> s/^^^/ >>> BAR1 contains the 8KiB record exchange buffer, which is the implemented maximum record size limit. >> Corrected >> >>> >>> >>>> +Backing File >>> >>> s/^^^/Backing Storage Format/ >> Corrected >> >>> >>>> +------------ >>> >>> >>>> + >>>> +The ACPI ERST persistent storage is contained within a single backing >>>> +file. The size and location of the backing file is specified upon >>>> +QEMU startup of the ACPI ERST device. >>> >>> I'd drop above paragraph and describe file format here, >>> ultimately used backend doesn't have to be a file. For >>> example if user doesn't need it persist over QEMU restarts, >>> ram backend could be used, guest will still be able to see >>> it's own crash log after guest is reboot, or it could be >>> memfd backend passed to QEMU by mgmt layer. >> Dropped >> >>> >>> >>>> +Records are stored in the backing file in a simple fashion. >>> s/backing file/backend storage/ >>> ditto for other occurrences >> Corrected >> >>> >>>> +The backing file is essentially divided into fixed size >>>> +"slots", ERST_RECORD_SIZE in length, with each "slot" >>>> +storing a single record. >>> >>>> No attempt at optimizing storage >>>> +through compression, compaction, etc is attempted. >>> s/^^^// >> >> I'd like to keep this statement. It is there because in a number of >> hardware BIOS I tested, these kinds of features lead to bugs in the >> ERST support. > this doc it's not about issues in other BIOSes, it's about conrete > QEMU impl. So sentence starting with "No attempt" is not relevant here at all. Dropped > >>>> +NOTE that any change to this value will make any pre- >>>> +existing backing files, not of the same ERST_RECORD_SIZE, >>>> +unusable to the guest. >>> when that can happen, can we detect it and error out? >> I've dropped this statement. That value is hard coded, and not a >> parameter, so there is no simple way to change it. This comment >> does exist next to the ERST_RECORD_SIZE declaration in the code. > > It's not a problem with current impl. but rather with possible > future expansion. > > If you'd add a header with record size at the start of storage, > it wouldn't be issue as ERST would be able to get used record > size for storage. That will help with avoiding compat issues > later on. I'll go ahead and add the header. I'll put the magic and record size in there, but I do not intend to put any data that would be "cached" from the records themselves. So no recordids, in particular, will be cached in this header. I'm not even sure I want to record/cache the number of records because: - it has almost no use (undermined by the fact overall storage size is not exposed, imho) - we backed off requiring the share=on, so it is conceivable this header value could encounter data integrity issues, should a guest crash... - scans still happen (see next) While having it (number of records cached in header) would avoid a startup scan to compute it, the write operation requires a scan to determine if the incoming recordid is present or not, in order to determine overwrite or allocate-and-write. And typically the first operation that Linux does is effectively a scan to populate the /sys/fs/pstore entries via the GET_RECORD_IDENTIFIER action. And the typical size of the ERST storage [on hardware systems] is 64 to 128KiB; so not much storage to examine, especially since only looking at 12 bytes of each 8KiB record. I'd still be in favor of putting an upper bound on the hostmem object, to avoid your worst case fears... > >>>> +Below is an example layout of the backing store file. >>>> +The size of the file is a multiple of ERST_RECORD_SIZE, >>>> +and contains N number of "slots" to store records. The >>>> +example below shows two records (in CPER format) in the >>>> +backing file, while the remaining slots are empty/ >>>> +available. >>>> + >>>> + Slot Record >>>> + +--------------------------------------------+ >>>> + 0 | empty/available | >>>> + +--------------------------------------------+ >>>> + 1 | CPER | >>>> + +--------------------------------------------+ >>>> + 2 | CPER | >>>> + +--------------------------------------------+ >>>> + ... | | >>>> + +--------------------------------------------+ >>>> + N | empty/available | >>>> + +--------------------------------------------+ >>>> + <-------------- ERST_RECORD_SIZE ------------> >>> >>> >>>> +Not all slots need to be occupied, and they need not be >>>> +occupied in a contiguous fashion. The ability to clear/erase >>>> +specific records allows for the formation of unoccupied >>>> +slots. >>> I'd drop this as not necessary >> >> I'd like to keep this statement. Again, several BIOS on which I tested >> ERST had bugs around non-contiguous record storage. > > I'd drop this and alter sentence above ending with " in a simple fashion." > to describe sparse usage of storage and then after that comes example diagram. Done > > I'd like this part to start with unambiguous concise description of > format and to be finished with example diagram. > It is the part that will be considered as the the only true source > how file should be formatted, when it comes to fixing bugs or > modifying original impl. later on. So it's important to have clear > description without any unnecessary information here. Done > > >>> >>> >>>> + >>>> + >>>> +References >>>> +---------- >>>> + >>>> +[1] "Advanced Configuration and Power Interface Specification", >>>> + version 4.0, June 2009. >>>> + >>>> +[2] "Unified Extensible Firmware Interface Specification", >>>> + version 2.1, October 2008. >>>> + >>>> -- >>>> 1.8.3.1 >>>> >>> >> >
On Mon, 26 Jul 2021 14:52:15 -0500 Eric DeVolder <eric.devolder@oracle.com> wrote: > On 7/26/21 5:06 AM, Igor Mammedov wrote: > > On Wed, 21 Jul 2021 10:42:33 -0500 > > Eric DeVolder <eric.devolder@oracle.com> wrote: > > > >> On 7/19/21 10:02 AM, Igor Mammedov wrote: > >>> On Wed, 30 Jun 2021 19:26:39 +0000 > >>> Eric DeVolder <eric.devolder@oracle.com> wrote: > >>> > >>>> Oops, at the end of the 4th paragraph, I meant to state that "Linux does not support the NVRAM mode." > >>>> rather than "non-NVRAM mode", which contradicts everything I stated prior. > >>>> Eric. > >>>> ________________________________ > >>>> From: Eric DeVolder <eric.devolder@oracle.com> > >>>> Sent: Wednesday, June 30, 2021 2:07 PM > >>>> To: qemu-devel@nongnu.org <qemu-devel@nongnu.org> > >>>> Cc: mst@redhat.com <mst@redhat.com>; imammedo@redhat.com <imammedo@redhat.com>; marcel.apfelbaum@gmail.com <marcel.apfelbaum@gmail.com>; pbonzini@redhat.com <pbonzini@redhat.com>; rth@twiddle.net <rth@twiddle.net>; ehabkost@redhat.com <ehabkost@redhat.com>; Konrad Wilk <konrad.wilk@oracle.com>; Boris Ostrovsky <boris.ostrovsky@oracle.com> > >>>> Subject: [PATCH v5 02/10] ACPI ERST: specification for ERST support > >>>> > >>>> Information on the implementation of the ACPI ERST support. > >>>> > >>>> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> > >>>> --- > >>>> docs/specs/acpi_erst.txt | 152 +++++++++++++++++++++++++++++++++++++++++++++++ > >>>> 1 file changed, 152 insertions(+) > >>>> create mode 100644 docs/specs/acpi_erst.txt > >>>> > >>>> diff --git a/docs/specs/acpi_erst.txt b/docs/specs/acpi_erst.txt > >>>> new file mode 100644 > >>>> index 0000000..79f8eb9 > >>>> --- /dev/null > >>>> +++ b/docs/specs/acpi_erst.txt > >>>> @@ -0,0 +1,152 @@ > >>>> +ACPI ERST DEVICE > >>>> +================ > >>>> + > >>>> +The ACPI ERST device is utilized to support the ACPI Error Record > >>>> +Serialization Table, ERST, functionality. The functionality is > >>>> +designed for storing error records in persistent storage for > >>>> +future reference/debugging. > >>>> + > >>>> +The ACPI specification[1], in Chapter "ACPI Platform Error Interfaces > >>>> +(APEI)", and specifically subsection "Error Serialization", outlines > >>>> +a method for storing error records into persistent storage. > >>>> + > >>>> +The format of error records is described in the UEFI specification[2], > >>>> +in Appendix N "Common Platform Error Record". > >>>> + > >>>> +While the ACPI specification allows for an NVRAM "mode" (see > >>>> +GET_ERROR_LOG_ADDRESS_RANGE_ATTRIBUTES) where non-volatile RAM is > >>>> +directly exposed for direct access by the OS/guest, this implements > >>>> +the non-NVRAM "mode". This non-NVRAM "mode" is what is implemented > >>>> +by most BIOS (since flash memory requires programming operations > >>>> +in order to update its contents). Furthermore, as of the time of this > >>>> +writing, Linux does not support the non-NVRAM "mode". > >>> > >>> shouldn't it be s/non-NVRAM/NVRAM/ ? > >> > >> Yes, it has been corrected. > >> > >>> > >>>> + > >>>> + > >>>> +Background/Motivation > >>>> +--------------------- > >>>> +Linux uses the persistent storage filesystem, pstore, to record > >>>> +information (eg. dmesg tail) upon panics and shutdowns. Pstore is > >>>> +independent of, and runs before, kdump. In certain scenarios (ie. > >>>> +hosts/guests with root filesystems on NFS/iSCSI where networking > >>>> +software and/or hardware fails), pstore may contain the only > >>>> +information available for post-mortem debugging. > >>> > >>> well, > >>> it's not the only way, one can use existing pvpanic device to notify > >>> mgmt layer about crash and mgmt layer can take appropriate measures > >>> to for post-mortem debugging, including dumping guest state, > >>> which is superior to anything pstore can offer as VM is still exists > >>> and mgmt layer can inspect VMs crashed state directly or dump > >>> necessary parts of it. > >>> > >>> So ERST shouldn't be portrayed as the only way here but rather > >>> as limited alternative to pvpanic in regards to post-mortem debugging > >>> (it's the only way only on bare-metal). > >>> > >>> It would be better to describe here other use-cases you've mentioned > >>> in earlier reviews, that justify adding alternative to pvpanic. > >> > >> I'm not sure how I would change this. I do say "may contain", which means it > >> is not the only way. Pvpanic is a way to notify the mgmt layer/host, but > >> this is a method solely with the guest. Each serves a different purpose; > >> plugs a different hole. > >> > > > > I'd suggest edit "pstore may contain the only information" as "pstore may contain information" > > > Done > > >> As noted in a separate message, my company has intentions of storing other > >> data in ERST beyond panics. > > perhaps add your use cases here as well. > > > > > >>>> +Two common storage backends for the pstore filesystem are ACPI ERST > >>>> +and UEFI. Most BIOS implement ACPI ERST. UEFI is not utilized in > >>>> +all guests. With QEMU supporting ACPI ERST, it becomes a viable > >>>> +pstore storage backend for virtual machines (as it is now for > >>>> +bare metal machines). > >>>> + > >>> > >>>> +Enabling support for ACPI ERST facilitates a consistent method to > >>>> +capture kernel panic information in a wide range of guests: from > >>>> +resource-constrained microvms to very large guests, and in > >>>> +particular, in direct-boot environments (which would lack UEFI > >>>> +run-time services). > >>> this hunk probably not necessary > >>> > >>>> + > >>>> +Note that Microsoft Windows also utilizes the ACPI ERST for certain > >>>> +crash information, if available. > >>> a pointer to a relevant source would be helpful here. > >> > >> I've included the reference, here for your benefit. > >> Windows Hardware Error Architecutre, specifically Persistence Mechanism > >> https://docs.microsoft.com/en-us/windows-hardware/drivers/whea/error-record-persistence-mechanism > >> > >>> > >>>> +Invocation > >>> s/^^/Configuration|Usage/ > >> > >> Corrected > >> > >>> > >>>> +---------- > >>>> + > >>>> +To utilize ACPI ERST, a memory-backend-file object and acpi-erst > >>> s/utilize/use/ > >> > >> Corrected > >> > >>> > >>>> +device must be created, for example: > >>> s/must/can/ > >> > >> Corrected > >> > >>> > >>>> + > >>>> + qemu ... > >>>> + -object memory-backend-file,id=erstnvram,mem-path=acpi-erst.backing, > >>>> + size=0x10000,share=on > >>> I'd put ^^^ on the same line as -object and use '\' at the end the > >>> so example could be easily copy-pasted > >> > >> Corrected > >> > >>> > >>>> + -device acpi-erst,memdev=erstnvram > >>>> + > >>>> +For proper operation, the ACPI ERST device needs a memory-backend-file > >>>> +object with the following parameters: > >>>> + > >>>> + - id: The id of the memory-backend-file object is used to associate > >>>> + this memory with the acpi-erst device. > >>>> + - size: The size of the ACPI ERST backing storage. This parameter is > >>>> + required. > >>>> + - mem-path: The location of the ACPI ERST backing storage file. This > >>>> + parameter is also required. > >>>> + - share: The share=on parameter is required so that updates to the > >>>> + ERST back store are written to the file immediately as well. Without > >>>> + it, updates the the backing file are unpredictable and may not > >>>> + properly persist (eg. if qemu should crash). > >>> > >>> mmap manpage says: > >>> MAP_SHARED > >>> Updates to the mapping ... are carried through to the underlying file. > >>> it doesn't guarantee 'written to the file immediately', though. > >>> So I'd rephrase it to something like that: > >>> > >>> - share: The share=on parameter is required so that updates to the ERST back store > >>> are written back to the file. > >> > >> Corrected > >> > >>> > >>>> + > >>>> +The ACPI ERST device is a simple PCI device, and requires this one > >>>> +parameter: > >>> s/^.*:/and ERST device:/ > >> > >> Corrected > >> > >>> > >>>> + > >>>> + - memdev: Is the object id of the memory-backend-file. > >>>> + > >>>> + > >>>> +PCI Interface > >>>> +------------- > >>>> + > >>>> +The ERST device is a PCI device with two BARs, one for accessing > >>>> +the programming registers, and the other for accessing the > >>>> +record exchange buffer. > >>>> + > >>>> +BAR0 contains the programming interface consisting of just two > >>>> +64-bit registers. The two registers are an ACTION (cmd) and a > >>>> +VALUE (data). All ERST actions/operations/side effects happen > >>> s/consisting of... All ERST/consisting of ACTION and VALUE 64-bit registers. All ERST/ > >> > >> Corrected > >> > >>> > >>>> +on the write to the ACTION, by design. Thus any data needed > >>> s/Thus// > >> Corrected > >> > >>> > >>>> +by the action must be placed into VALUE prior to writing > >>>> +ACTION. Reading the VALUE simply returns the register contents, > >>>> +which can be updated by a previous ACTION. > >>> > >>>> This behavior is > >>>> +encoded in the ACPI ERST table generated by QEMU. > >>> it's too vague, Either drop sentence or add a reference to relevant place in spec. > >> Corrected > >> > >>> > >>> > >>>> + > >>>> +BAR1 contains the record exchange buffer, and the size of this > >>>> +buffer sets the maximum record size. This record exchange > >>>> +buffer size is 8KiB. > >>> s/^^^/ > >>> BAR1 contains the 8KiB record exchange buffer, which is the implemented maximum record size limit. > >> Corrected > >> > >>> > >>> > >>>> +Backing File > >>> > >>> s/^^^/Backing Storage Format/ > >> Corrected > >> > >>> > >>>> +------------ > >>> > >>> > >>>> + > >>>> +The ACPI ERST persistent storage is contained within a single backing > >>>> +file. The size and location of the backing file is specified upon > >>>> +QEMU startup of the ACPI ERST device. > >>> > >>> I'd drop above paragraph and describe file format here, > >>> ultimately used backend doesn't have to be a file. For > >>> example if user doesn't need it persist over QEMU restarts, > >>> ram backend could be used, guest will still be able to see > >>> it's own crash log after guest is reboot, or it could be > >>> memfd backend passed to QEMU by mgmt layer. > >> Dropped > >> > >>> > >>> > >>>> +Records are stored in the backing file in a simple fashion. > >>> s/backing file/backend storage/ > >>> ditto for other occurrences > >> Corrected > >> > >>> > >>>> +The backing file is essentially divided into fixed size > >>>> +"slots", ERST_RECORD_SIZE in length, with each "slot" > >>>> +storing a single record. > >>> > >>>> No attempt at optimizing storage > >>>> +through compression, compaction, etc is attempted. > >>> s/^^^// > >> > >> I'd like to keep this statement. It is there because in a number of > >> hardware BIOS I tested, these kinds of features lead to bugs in the > >> ERST support. > > this doc it's not about issues in other BIOSes, it's about conrete > > QEMU impl. So sentence starting with "No attempt" is not relevant here at all. > Dropped > > > > >>>> +NOTE that any change to this value will make any pre- > >>>> +existing backing files, not of the same ERST_RECORD_SIZE, > >>>> +unusable to the guest. > >>> when that can happen, can we detect it and error out? > >> I've dropped this statement. That value is hard coded, and not a > >> parameter, so there is no simple way to change it. This comment > >> does exist next to the ERST_RECORD_SIZE declaration in the code. > > > > It's not a problem with current impl. but rather with possible > > future expansion. > > > > If you'd add a header with record size at the start of storage, > > it wouldn't be issue as ERST would be able to get used record > > size for storage. That will help with avoiding compat issues > > later on. > I'll go ahead and add the header. I'll put the magic and record size in there, > but I do not intend to put any data that would be "cached" from the records > themselves. So no recordids, in particular, will be cached in this header. maybe also add offset of the 1st slot, so however comes later to fix performance issues will have less compatibility issues. > > I'm not even sure I want to record/cache the number of records because: > - it has almost no use (undermined by the fact overall storage size is not exposed, imho) > - we backed off requiring the share=on, so it is conceivable this header value could > encounter data integrity issues, should a guest crash... guest crash won't affect data, and if backend is not shared then, data won't be persistently stored anyways, they will live only for lifetime of QEMU instance. The only time where integrity is affected is host crash and we already agreed that we don't care about this case. > - scans still happen (see next) > > While having it (number of records cached in header) would avoid a startup scan > to compute it, the write operation requires a scan to determine if the incoming > recordid is present or not, in order to determine overwrite or allocate-and-write. if present/non present per slot status is cached, you don't have to do expensive full scan when guest scans slots. > And typically the first operation that Linux does is effectively a scan to > populate the /sys/fs/pstore entries via the GET_RECORD_IDENTIFIER action. > > And the typical size of the ERST storage [on hardware systems] is 64 to 128KiB; > so not much storage to examine, especially since only looking at 12 bytes of each > 8KiB record. > > I'd still be in favor of putting an upper bound on the hostmem object, to avoid > your worst case fears... Considering device is not present by default, I give up on trying to convince you to design it efficiently. If one would wish to use this with container like workloads where fast startup matters, one would have to live with crappy performance or rewrite your impl. > >>>> +Below is an example layout of the backing store file. > >>>> +The size of the file is a multiple of ERST_RECORD_SIZE, > >>>> +and contains N number of "slots" to store records. The > >>>> +example below shows two records (in CPER format) in the > >>>> +backing file, while the remaining slots are empty/ > >>>> +available. > >>>> + > >>>> + Slot Record > >>>> + +--------------------------------------------+ > >>>> + 0 | empty/available | > >>>> + +--------------------------------------------+ > >>>> + 1 | CPER | > >>>> + +--------------------------------------------+ > >>>> + 2 | CPER | > >>>> + +--------------------------------------------+ > >>>> + ... | | > >>>> + +--------------------------------------------+ > >>>> + N | empty/available | > >>>> + +--------------------------------------------+ > >>>> + <-------------- ERST_RECORD_SIZE ------------> > >>> > >>> > >>>> +Not all slots need to be occupied, and they need not be > >>>> +occupied in a contiguous fashion. The ability to clear/erase > >>>> +specific records allows for the formation of unoccupied > >>>> +slots. > >>> I'd drop this as not necessary > >> > >> I'd like to keep this statement. Again, several BIOS on which I tested > >> ERST had bugs around non-contiguous record storage. > > > > I'd drop this and alter sentence above ending with " in a simple fashion." > > to describe sparse usage of storage and then after that comes example diagram. > Done > > > > > I'd like this part to start with unambiguous concise description of > > format and to be finished with example diagram. > > It is the part that will be considered as the the only true source > > how file should be formatted, when it comes to fixing bugs or > > modifying original impl. later on. So it's important to have clear > > description without any unnecessary information here. > Done > > > > > > >>> > >>> > >>>> + > >>>> + > >>>> +References > >>>> +---------- > >>>> + > >>>> +[1] "Advanced Configuration and Power Interface Specification", > >>>> + version 4.0, June 2009. > >>>> + > >>>> +[2] "Unified Extensible Firmware Interface Specification", > >>>> + version 2.1, October 2008. > >>>> + > >>>> -- > >>>> 1.8.3.1 > >>>> > >>> > >> > > >
On 7/27/21 6:45 AM, Igor Mammedov wrote: > On Mon, 26 Jul 2021 14:52:15 -0500 > Eric DeVolder <eric.devolder@oracle.com> wrote: > >> On 7/26/21 5:06 AM, Igor Mammedov wrote: >>> On Wed, 21 Jul 2021 10:42:33 -0500 >>> Eric DeVolder <eric.devolder@oracle.com> wrote: >>> >>>> On 7/19/21 10:02 AM, Igor Mammedov wrote: >>>>> On Wed, 30 Jun 2021 19:26:39 +0000 >>>>> Eric DeVolder <eric.devolder@oracle.com> wrote: >>>>> >>>>>> Oops, at the end of the 4th paragraph, I meant to state that "Linux does not support the NVRAM mode." >>>>>> rather than "non-NVRAM mode", which contradicts everything I stated prior. >>>>>> Eric. >>>>>> ________________________________ >>>>>> From: Eric DeVolder <eric.devolder@oracle.com> >>>>>> Sent: Wednesday, June 30, 2021 2:07 PM >>>>>> To: qemu-devel@nongnu.org <qemu-devel@nongnu.org> >>>>>> Cc: mst@redhat.com <mst@redhat.com>; imammedo@redhat.com <imammedo@redhat.com>; marcel.apfelbaum@gmail.com <marcel.apfelbaum@gmail.com>; pbonzini@redhat.com <pbonzini@redhat.com>; rth@twiddle.net <rth@twiddle.net>; ehabkost@redhat.com <ehabkost@redhat.com>; Konrad Wilk <konrad.wilk@oracle.com>; Boris Ostrovsky <boris.ostrovsky@oracle.com> >>>>>> Subject: [PATCH v5 02/10] ACPI ERST: specification for ERST support >>>>>> >>>>>> Information on the implementation of the ACPI ERST support. >>>>>> >>>>>> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> >>>>>> --- >>>>>> docs/specs/acpi_erst.txt | 152 +++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> 1 file changed, 152 insertions(+) >>>>>> create mode 100644 docs/specs/acpi_erst.txt >>>>>> >>>>>> diff --git a/docs/specs/acpi_erst.txt b/docs/specs/acpi_erst.txt >>>>>> new file mode 100644 >>>>>> index 0000000..79f8eb9 >>>>>> --- /dev/null >>>>>> +++ b/docs/specs/acpi_erst.txt >>>>>> @@ -0,0 +1,152 @@ >>>>>> +ACPI ERST DEVICE >>>>>> +================ >>>>>> + >>>>>> +The ACPI ERST device is utilized to support the ACPI Error Record >>>>>> +Serialization Table, ERST, functionality. The functionality is >>>>>> +designed for storing error records in persistent storage for >>>>>> +future reference/debugging. >>>>>> + >>>>>> +The ACPI specification[1], in Chapter "ACPI Platform Error Interfaces >>>>>> +(APEI)", and specifically subsection "Error Serialization", outlines >>>>>> +a method for storing error records into persistent storage. >>>>>> + >>>>>> +The format of error records is described in the UEFI specification[2], >>>>>> +in Appendix N "Common Platform Error Record". >>>>>> + >>>>>> +While the ACPI specification allows for an NVRAM "mode" (see >>>>>> +GET_ERROR_LOG_ADDRESS_RANGE_ATTRIBUTES) where non-volatile RAM is >>>>>> +directly exposed for direct access by the OS/guest, this implements >>>>>> +the non-NVRAM "mode". This non-NVRAM "mode" is what is implemented >>>>>> +by most BIOS (since flash memory requires programming operations >>>>>> +in order to update its contents). Furthermore, as of the time of this >>>>>> +writing, Linux does not support the non-NVRAM "mode". >>>>> >>>>> shouldn't it be s/non-NVRAM/NVRAM/ ? >>>> >>>> Yes, it has been corrected. >>>> >>>>> >>>>>> + >>>>>> + >>>>>> +Background/Motivation >>>>>> +--------------------- >>>>>> +Linux uses the persistent storage filesystem, pstore, to record >>>>>> +information (eg. dmesg tail) upon panics and shutdowns. Pstore is >>>>>> +independent of, and runs before, kdump. In certain scenarios (ie. >>>>>> +hosts/guests with root filesystems on NFS/iSCSI where networking >>>>>> +software and/or hardware fails), pstore may contain the only >>>>>> +information available for post-mortem debugging. >>>>> >>>>> well, >>>>> it's not the only way, one can use existing pvpanic device to notify >>>>> mgmt layer about crash and mgmt layer can take appropriate measures >>>>> to for post-mortem debugging, including dumping guest state, >>>>> which is superior to anything pstore can offer as VM is still exists >>>>> and mgmt layer can inspect VMs crashed state directly or dump >>>>> necessary parts of it. >>>>> >>>>> So ERST shouldn't be portrayed as the only way here but rather >>>>> as limited alternative to pvpanic in regards to post-mortem debugging >>>>> (it's the only way only on bare-metal). >>>>> >>>>> It would be better to describe here other use-cases you've mentioned >>>>> in earlier reviews, that justify adding alternative to pvpanic. >>>> >>>> I'm not sure how I would change this. I do say "may contain", which means it >>>> is not the only way. Pvpanic is a way to notify the mgmt layer/host, but >>>> this is a method solely with the guest. Each serves a different purpose; >>>> plugs a different hole. >>>> >>> >>> I'd suggest edit "pstore may contain the only information" as "pstore may contain information" >>> >> Done >> >>>> As noted in a separate message, my company has intentions of storing other >>>> data in ERST beyond panics. >>> perhaps add your use cases here as well. >>> >>> >>>>>> +Two common storage backends for the pstore filesystem are ACPI ERST >>>>>> +and UEFI. Most BIOS implement ACPI ERST. UEFI is not utilized in >>>>>> +all guests. With QEMU supporting ACPI ERST, it becomes a viable >>>>>> +pstore storage backend for virtual machines (as it is now for >>>>>> +bare metal machines). >>>>>> + >>>>> >>>>>> +Enabling support for ACPI ERST facilitates a consistent method to >>>>>> +capture kernel panic information in a wide range of guests: from >>>>>> +resource-constrained microvms to very large guests, and in >>>>>> +particular, in direct-boot environments (which would lack UEFI >>>>>> +run-time services). >>>>> this hunk probably not necessary >>>>> >>>>>> + >>>>>> +Note that Microsoft Windows also utilizes the ACPI ERST for certain >>>>>> +crash information, if available. >>>>> a pointer to a relevant source would be helpful here. >>>> >>>> I've included the reference, here for your benefit. >>>> Windows Hardware Error Architecutre, specifically Persistence Mechanism >>>> https://docs.microsoft.com/en-us/windows-hardware/drivers/whea/error-record-persistence-mechanism >>>> >>>>> >>>>>> +Invocation >>>>> s/^^/Configuration|Usage/ >>>> >>>> Corrected >>>> >>>>> >>>>>> +---------- >>>>>> + >>>>>> +To utilize ACPI ERST, a memory-backend-file object and acpi-erst >>>>> s/utilize/use/ >>>> >>>> Corrected >>>> >>>>> >>>>>> +device must be created, for example: >>>>> s/must/can/ >>>> >>>> Corrected >>>> >>>>> >>>>>> + >>>>>> + qemu ... >>>>>> + -object memory-backend-file,id=erstnvram,mem-path=acpi-erst.backing, >>>>>> + size=0x10000,share=on >>>>> I'd put ^^^ on the same line as -object and use '\' at the end the >>>>> so example could be easily copy-pasted >>>> >>>> Corrected >>>> >>>>> >>>>>> + -device acpi-erst,memdev=erstnvram >>>>>> + >>>>>> +For proper operation, the ACPI ERST device needs a memory-backend-file >>>>>> +object with the following parameters: >>>>>> + >>>>>> + - id: The id of the memory-backend-file object is used to associate >>>>>> + this memory with the acpi-erst device. >>>>>> + - size: The size of the ACPI ERST backing storage. This parameter is >>>>>> + required. >>>>>> + - mem-path: The location of the ACPI ERST backing storage file. This >>>>>> + parameter is also required. >>>>>> + - share: The share=on parameter is required so that updates to the >>>>>> + ERST back store are written to the file immediately as well. Without >>>>>> + it, updates the the backing file are unpredictable and may not >>>>>> + properly persist (eg. if qemu should crash). >>>>> >>>>> mmap manpage says: >>>>> MAP_SHARED >>>>> Updates to the mapping ... are carried through to the underlying file. >>>>> it doesn't guarantee 'written to the file immediately', though. >>>>> So I'd rephrase it to something like that: >>>>> >>>>> - share: The share=on parameter is required so that updates to the ERST back store >>>>> are written back to the file. >>>> >>>> Corrected >>>> >>>>> >>>>>> + >>>>>> +The ACPI ERST device is a simple PCI device, and requires this one >>>>>> +parameter: >>>>> s/^.*:/and ERST device:/ >>>> >>>> Corrected >>>> >>>>> >>>>>> + >>>>>> + - memdev: Is the object id of the memory-backend-file. >>>>>> + >>>>>> + >>>>>> +PCI Interface >>>>>> +------------- >>>>>> + >>>>>> +The ERST device is a PCI device with two BARs, one for accessing >>>>>> +the programming registers, and the other for accessing the >>>>>> +record exchange buffer. >>>>>> + >>>>>> +BAR0 contains the programming interface consisting of just two >>>>>> +64-bit registers. The two registers are an ACTION (cmd) and a >>>>>> +VALUE (data). All ERST actions/operations/side effects happen >>>>> s/consisting of... All ERST/consisting of ACTION and VALUE 64-bit registers. All ERST/ >>>> >>>> Corrected >>>> >>>>> >>>>>> +on the write to the ACTION, by design. Thus any data needed >>>>> s/Thus// >>>> Corrected >>>> >>>>> >>>>>> +by the action must be placed into VALUE prior to writing >>>>>> +ACTION. Reading the VALUE simply returns the register contents, >>>>>> +which can be updated by a previous ACTION. >>>>> >>>>>> This behavior is >>>>>> +encoded in the ACPI ERST table generated by QEMU. >>>>> it's too vague, Either drop sentence or add a reference to relevant place in spec. >>>> Corrected >>>> >>>>> >>>>> >>>>>> + >>>>>> +BAR1 contains the record exchange buffer, and the size of this >>>>>> +buffer sets the maximum record size. This record exchange >>>>>> +buffer size is 8KiB. >>>>> s/^^^/ >>>>> BAR1 contains the 8KiB record exchange buffer, which is the implemented maximum record size limit. >>>> Corrected >>>> >>>>> >>>>> >>>>>> +Backing File >>>>> >>>>> s/^^^/Backing Storage Format/ >>>> Corrected >>>> >>>>> >>>>>> +------------ >>>>> >>>>> >>>>>> + >>>>>> +The ACPI ERST persistent storage is contained within a single backing >>>>>> +file. The size and location of the backing file is specified upon >>>>>> +QEMU startup of the ACPI ERST device. >>>>> >>>>> I'd drop above paragraph and describe file format here, >>>>> ultimately used backend doesn't have to be a file. For >>>>> example if user doesn't need it persist over QEMU restarts, >>>>> ram backend could be used, guest will still be able to see >>>>> it's own crash log after guest is reboot, or it could be >>>>> memfd backend passed to QEMU by mgmt layer. >>>> Dropped >>>> >>>>> >>>>> >>>>>> +Records are stored in the backing file in a simple fashion. >>>>> s/backing file/backend storage/ >>>>> ditto for other occurrences >>>> Corrected >>>> >>>>> >>>>>> +The backing file is essentially divided into fixed size >>>>>> +"slots", ERST_RECORD_SIZE in length, with each "slot" >>>>>> +storing a single record. >>>>> >>>>>> No attempt at optimizing storage >>>>>> +through compression, compaction, etc is attempted. >>>>> s/^^^// >>>> >>>> I'd like to keep this statement. It is there because in a number of >>>> hardware BIOS I tested, these kinds of features lead to bugs in the >>>> ERST support. >>> this doc it's not about issues in other BIOSes, it's about conrete >>> QEMU impl. So sentence starting with "No attempt" is not relevant here at all. >> Dropped >> >>> >>>>>> +NOTE that any change to this value will make any pre- >>>>>> +existing backing files, not of the same ERST_RECORD_SIZE, >>>>>> +unusable to the guest. >>>>> when that can happen, can we detect it and error out? >>>> I've dropped this statement. That value is hard coded, and not a >>>> parameter, so there is no simple way to change it. This comment >>>> does exist next to the ERST_RECORD_SIZE declaration in the code. >>> >>> It's not a problem with current impl. but rather with possible >>> future expansion. >>> >>> If you'd add a header with record size at the start of storage, >>> it wouldn't be issue as ERST would be able to get used record >>> size for storage. That will help with avoiding compat issues >>> later on. >> I'll go ahead and add the header. I'll put the magic and record size in there, >> but I do not intend to put any data that would be "cached" from the records >> themselves. So no recordids, in particular, will be cached in this header. > maybe also add offset of the 1st slot, so however comes later > to fix performance issues will have less compatibility issues. Done > >> >> I'm not even sure I want to record/cache the number of records because: >> - it has almost no use (undermined by the fact overall storage size is not exposed, imho) >> - we backed off requiring the share=on, so it is conceivable this header value could >> encounter data integrity issues, should a guest crash... > guest crash won't affect data, and if backend is not shared then, > data won't be persistently stored anyways, they will live only for > lifetime of QEMU instance. > The only time where integrity is affected is host crash and we already > agreed that we don't care about this case. See further below > >> - scans still happen (see next) >> >> While having it (number of records cached in header) would avoid a startup scan >> to compute it, the write operation requires a scan to determine if the incoming >> recordid is present or not, in order to determine overwrite or allocate-and-write. > if present/non present per slot status is cached, you don't have to do > expensive full scan when guest scans slots. Done > >> And typically the first operation that Linux does is effectively a scan to >> populate the /sys/fs/pstore entries via the GET_RECORD_IDENTIFIER action. >> >> And the typical size of the ERST storage [on hardware systems] is 64 to 128KiB; >> so not much storage to examine, especially since only looking at 12 bytes of each >> 8KiB record. >> >> I'd still be in favor of putting an upper bound on the hostmem object, to avoid >> your worst case fears... > > Considering device is not present by default, I give up on > trying to convince you to design it efficiently. > > If one would wish to use this with container like workloads > where fast startup matters, one would have to live with crappy > performance or rewrite your impl. I've embraced your assurance of no data integrity issues, and have changed the implementation to include a header that also tracks/caches the record_ids. This eliminates all scanning of the entire backend storage. My original goal was to offer ERST as BIOS do, so backend storage size of about 64 to 128KiB; where the current implementation would be just fine. But I did mention that we were looking to do more with ERST, and the backend storage can be quite large, so you are right to push for better implementation. > >>>>>> +Below is an example layout of the backing store file. >>>>>> +The size of the file is a multiple of ERST_RECORD_SIZE, >>>>>> +and contains N number of "slots" to store records. The >>>>>> +example below shows two records (in CPER format) in the >>>>>> +backing file, while the remaining slots are empty/ >>>>>> +available. >>>>>> + >>>>>> + Slot Record >>>>>> + +--------------------------------------------+ >>>>>> + 0 | empty/available | >>>>>> + +--------------------------------------------+ >>>>>> + 1 | CPER | >>>>>> + +--------------------------------------------+ >>>>>> + 2 | CPER | >>>>>> + +--------------------------------------------+ >>>>>> + ... | | >>>>>> + +--------------------------------------------+ >>>>>> + N | empty/available | >>>>>> + +--------------------------------------------+ >>>>>> + <-------------- ERST_RECORD_SIZE ------------> >>>>> >>>>> >>>>>> +Not all slots need to be occupied, and they need not be >>>>>> +occupied in a contiguous fashion. The ability to clear/erase >>>>>> +specific records allows for the formation of unoccupied >>>>>> +slots. >>>>> I'd drop this as not necessary >>>> >>>> I'd like to keep this statement. Again, several BIOS on which I tested >>>> ERST had bugs around non-contiguous record storage. >>> >>> I'd drop this and alter sentence above ending with " in a simple fashion." >>> to describe sparse usage of storage and then after that comes example diagram. >> Done >> >>> >>> I'd like this part to start with unambiguous concise description of >>> format and to be finished with example diagram. >>> It is the part that will be considered as the the only true source >>> how file should be formatted, when it comes to fixing bugs or >>> modifying original impl. later on. So it's important to have clear >>> description without any unnecessary information here. >> Done >> >>> >>> >>>>> >>>>> >>>>>> + >>>>>> + >>>>>> +References >>>>>> +---------- >>>>>> + >>>>>> +[1] "Advanced Configuration and Power Interface Specification", >>>>>> + version 4.0, June 2009. >>>>>> + >>>>>> +[2] "Unified Extensible Firmware Interface Specification", >>>>>> + version 2.1, October 2008. >>>>>> + >>>>>> -- >>>>>> 1.8.3.1 >>>>>> >>>>> >>>> >>> >> >
diff --git a/docs/specs/acpi_erst.txt b/docs/specs/acpi_erst.txt new file mode 100644 index 0000000..79f8eb9 --- /dev/null +++ b/docs/specs/acpi_erst.txt @@ -0,0 +1,152 @@ +ACPI ERST DEVICE +================ + +The ACPI ERST device is utilized to support the ACPI Error Record +Serialization Table, ERST, functionality. The functionality is +designed for storing error records in persistent storage for +future reference/debugging. + +The ACPI specification[1], in Chapter "ACPI Platform Error Interfaces +(APEI)", and specifically subsection "Error Serialization", outlines +a method for storing error records into persistent storage. + +The format of error records is described in the UEFI specification[2], +in Appendix N "Common Platform Error Record". + +While the ACPI specification allows for an NVRAM "mode" (see +GET_ERROR_LOG_ADDRESS_RANGE_ATTRIBUTES) where non-volatile RAM is +directly exposed for direct access by the OS/guest, this implements +the non-NVRAM "mode". This non-NVRAM "mode" is what is implemented +by most BIOS (since flash memory requires programming operations +in order to update its contents). Furthermore, as of the time of this +writing, Linux does not support the non-NVRAM "mode". + + +Background/Motivation +--------------------- +Linux uses the persistent storage filesystem, pstore, to record +information (eg. dmesg tail) upon panics and shutdowns. Pstore is +independent of, and runs before, kdump. In certain scenarios (ie. +hosts/guests with root filesystems on NFS/iSCSI where networking +software and/or hardware fails), pstore may contain the only +information available for post-mortem debugging. + +Two common storage backends for the pstore filesystem are ACPI ERST +and UEFI. Most BIOS implement ACPI ERST. UEFI is not utilized in +all guests. With QEMU supporting ACPI ERST, it becomes a viable +pstore storage backend for virtual machines (as it is now for +bare metal machines). + +Enabling support for ACPI ERST facilitates a consistent method to +capture kernel panic information in a wide range of guests: from +resource-constrained microvms to very large guests, and in +particular, in direct-boot environments (which would lack UEFI +run-time services). + +Note that Microsoft Windows also utilizes the ACPI ERST for certain +crash information, if available. + + +Invocation +---------- + +To utilize ACPI ERST, a memory-backend-file object and acpi-erst +device must be created, for example: + + qemu ... + -object memory-backend-file,id=erstnvram,mem-path=acpi-erst.backing, + size=0x10000,share=on + -device acpi-erst,memdev=erstnvram + +For proper operation, the ACPI ERST device needs a memory-backend-file +object with the following parameters: + + - id: The id of the memory-backend-file object is used to associate + this memory with the acpi-erst device. + - size: The size of the ACPI ERST backing storage. This parameter is + required. + - mem-path: The location of the ACPI ERST backing storage file. This + parameter is also required. + - share: The share=on parameter is required so that updates to the + ERST back store are written to the file immediately as well. Without + it, updates the the backing file are unpredictable and may not + properly persist (eg. if qemu should crash). + +The ACPI ERST device is a simple PCI device, and requires this one +parameter: + + - memdev: Is the object id of the memory-backend-file. + + +PCI Interface +------------- + +The ERST device is a PCI device with two BARs, one for accessing +the programming registers, and the other for accessing the +record exchange buffer. + +BAR0 contains the programming interface consisting of just two +64-bit registers. The two registers are an ACTION (cmd) and a +VALUE (data). All ERST actions/operations/side effects happen +on the write to the ACTION, by design. Thus any data needed +by the action must be placed into VALUE prior to writing +ACTION. Reading the VALUE simply returns the register contents, +which can be updated by a previous ACTION. This behavior is +encoded in the ACPI ERST table generated by QEMU. + +BAR1 contains the record exchange buffer, and the size of this +buffer sets the maximum record size. This record exchange +buffer size is 8KiB. + +Backing File +------------ + +The ACPI ERST persistent storage is contained within a single backing +file. The size and location of the backing file is specified upon +QEMU startup of the ACPI ERST device. + +Records are stored in the backing file in a simple fashion. +The backing file is essentially divided into fixed size +"slots", ERST_RECORD_SIZE in length, with each "slot" +storing a single record. No attempt at optimizing storage +through compression, compaction, etc is attempted. +NOTE that any change to this value will make any pre- +existing backing files, not of the same ERST_RECORD_SIZE, +unusable to the guest. + +Below is an example layout of the backing store file. +The size of the file is a multiple of ERST_RECORD_SIZE, +and contains N number of "slots" to store records. The +example below shows two records (in CPER format) in the +backing file, while the remaining slots are empty/ +available. + + Slot Record + +--------------------------------------------+ + 0 | empty/available | + +--------------------------------------------+ + 1 | CPER | + +--------------------------------------------+ + 2 | CPER | + +--------------------------------------------+ + ... | | + +--------------------------------------------+ + N | empty/available | + +--------------------------------------------+ + <-------------- ERST_RECORD_SIZE ------------> + +Not all slots need to be occupied, and they need not be +occupied in a contiguous fashion. The ability to clear/erase +specific records allows for the formation of unoccupied +slots. + + +References +---------- + +[1] "Advanced Configuration and Power Interface Specification", + version 4.0, June 2009. + +[2] "Unified Extensible Firmware Interface Specification", + version 2.1, October 2008. +
Information on the implementation of the ACPI ERST support. Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> --- docs/specs/acpi_erst.txt | 152 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 152 insertions(+) create mode 100644 docs/specs/acpi_erst.txt