tcg: refactor pool data for simplicity and comprehension

Message ID	20250215021120.1647083-1-michael@anarch128.org (mailing list archive)
State	New
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> found: mail.anarch128.org); x-return-mx=pass smtp.domain=anarch128.org policy.is_org=yes (MX Records found: mail.anarch128.org) From: Michael Clark <michael@anarch128.org> To: qemu-devel@nongnu.org, Richard Henderson <richard.henderson@linaro.org>, Paolo Bonzini <pbonzini@redhat.com> Cc: Michael Clark <michael@anarch128.org> Subject: [PATCH] tcg: refactor pool data for simplicity and comprehension Date: Sat, 15 Feb 2025 15:11:20 +1300 Message-ID: <20250215021120.1647083-1-michael@anarch128.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2001:4801:7825:104:be76:4eff:fe10:52ae; envelope-from=michael@anarch128.org; helo=anarch128.org X-Spam_score_int: -16 X-Spam_score: -1.7 X-Spam_bar: - X-Spam_report: (-1.7 / 5.0 requ) BAYES_00=-1.9, DKIM_INVALID=0.1, DKIM_SIGNED=0.1, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	tcg: refactor pool data for simplicity and comprehension \| expand tcg: refactor pool data for simplicity and comprehension

Michael Clark Feb. 15, 2025, 2:11 a.m. UTC

the intent of this patch is more conventional nomenclature
but the constant pool data code is also simplified a little.

- merge new_pool_{alloc,insert} -> new_pool_data.
- rename TCGLabelPoolData -> TCGData.
- rename pool_labels -> pool_data.
- rename macro TCG_TARGET_NEED_POOL_DATA.
- move TCGData struct definition into tcg.h.
- comment translation block epilogue members.

TCGLabelPoolData is ambiguous and asks for potential confusion
with the unrelated TCGLabel type. there is no label in the sense
of TCGLabel. the structure contains data to be emitted at the end
of translation blocks at data_gen_ptr in tcg_out_pool_finalize.
rtype and label save emitting a seperate TCGRelocation which
would be a more conventional but slightly more costly approach.

the label member is merely a pointer to the instruction text to
be updated with the relative address of the constant, the primary
data is the constant data pool at the end of translation blocks.
this relates more closely to .data sections in offline codegen
if we were to imagine a translation block has .text and .data.

thus TCGData is more succinct and more reflective of what the
structure contains; data emitted in the constant data pool at
the end of translation blocks. also, pool_labels is renamed to
pool_data as the primary contents of the list is constant data.

finally, new_pool_alloc and new_pool_insert are merged into a
single function named new_pool_data, which moves nlongs to the
end of the parameter list with varargs to allocate, copy, and
insert constant data items to simplify new_pool_label et al.
a successive step would be to collapse callers into calling
new_pool_data and remove a layer of indirection.

Signed-off-by: Michael Clark <michael@anarch128.org>
---
 include/tcg/tcg.h    | 16 ++++++++--
 tcg/tcg.c            | 72 ++++++++++++++------------------------------
 tcg/tci/tcg-target.h |  2 +-
 3 files changed, 37 insertions(+), 53 deletions(-)

Richard Henderson Feb. 15, 2025, 5:58 p.m. UTC | #1

On 2/14/25 18:11, Michael Clark wrote:
> the intent of this patch is more conventional nomenclature
> but the constant pool data code is also simplified a little.
> 
> - merge new_pool_{alloc,insert} -> new_pool_data.
> - rename TCGLabelPoolData -> TCGData.
> - rename pool_labels -> pool_data.
> - rename macro TCG_TARGET_NEED_POOL_DATA.
> - move TCGData struct definition into tcg.h.
> - comment translation block epilogue members.

You can see from this list that this should be multiple patches.

> TCGLabelPoolData is ambiguous and asks for potential confusion
> with the unrelated TCGLabel type. there is no label in the sense
> of TCGLabel.

Fair.

> the label member is merely a pointer to the instruction text to
> be updated with the relative address of the constant, the primary
> data is the constant data pool at the end of translation blocks.
> this relates more closely to .data sections in offline codegen
> if we were to imagine a translation block has .text and .data.

No, it doesn't.  It relates most closely to data emitted within .text, accessed via 
pc-relative instructions with limited offsets.

This isn't a thing you'd have ever seen on x86 or x86_64, but it is quite common for arm32 
(12-bit offsets), sh4 (8-bit offsets), m68k (16-bit offsets) and such.  Because the 
offsets are so small, they could even be placed *within* functions not just between them.

> thus TCGData is more succinct and more reflective of what the
> structure contains; data emitted in the constant data pool at
> the end of translation blocks. also, pool_labels is renamed to
> pool_data as the primary contents of the list is constant data.

I guess.  TCGData is perhaps too short, but we can certainly avoid the confusion of "labels".

> 
> finally, new_pool_alloc and new_pool_insert are merged into a
> single function named new_pool_data, which moves nlongs to the
> end of the parameter list with varargs to allocate, copy, and
> insert constant data items to simplify new_pool_label et al.
> a successive step would be to collapse callers into calling
> new_pool_data and remove a layer of indirection.

Why?  varargs generally produces horrible code.
The split between alloc and insert was intentional to avoid this.

> diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
> index a9ca493d20f6..448c2330ef0f 100644
> --- a/tcg/tci/tcg-target.h
> +++ b/tcg/tci/tcg-target.h
> @@ -72,6 +72,6 @@ typedef enum {
>   } TCGReg;
>   
>   #define HAVE_TCG_QEMU_TB_EXEC
> -#define TCG_TARGET_NEED_POOL_LABELS
> +#define TCG_TARGET_NEED_POOL_DATA

Oops, this should have been removed with a417ef83.


r~

Michael Clark Feb. 15, 2025, 8:24 p.m. UTC | #2

On 2/16/25 06:58, Richard Henderson wrote:
> On 2/14/25 18:11, Michael Clark wrote:
>> the intent of this patch is more conventional nomenclature
>> but the constant pool data code is also simplified a little.
>>
>> - merge new_pool_{alloc,insert} -> new_pool_data.
>> - rename TCGLabelPoolData -> TCGData.
>> - rename pool_labels -> pool_data.
>> - rename macro TCG_TARGET_NEED_POOL_DATA.
>> - move TCGData struct definition into tcg.h.
>> - comment translation block epilogue members.
> 
> You can see from this list that this should be multiple patches.

possibly. possibly not. I don't know how much value it adds to split it 
up as one can see a logical change together in the message and diff. 
"label" to "data" essentially. splitting it splits the logical name 
change across commits and imho would be harder as a reader of history.

splitting the name change and the merging of the two functions seems a 
little complex to coordinate as I would need some intermediate names. 
they seem to be part of the same change. if I changed it to varargs 
without naming consistent with my thinking it makes less sense to me.

>> TCGLabelPoolData is ambiguous and asks for potential confusion
>> with the unrelated TCGLabel type. there is no label in the sense
>> of TCGLabel.
> 
> Fair.
> 
>> the label member is merely a pointer to the instruction text to
>> be updated with the relative address of the constant, the primary
>> data is the constant data pool at the end of translation blocks.
>> this relates more closely to .data sections in offline codegen
>> if we were to imagine a translation block has .text and .data.
> 
> No, it doesn't.  It relates most closely to data emitted within .text, 
> accessed via pc-relative instructions with limited offsets.
> 
> This isn't a thing you'd have ever seen on x86 or x86_64, but it is 
> quite common for arm32 (12-bit offsets), sh4 (8-bit offsets), m68k (16- 
> bit offsets) and such.  Because the offsets are so small, they could 
> even be placed *within* functions not just between them.

yes I am aware of what it is. sometimes referred to as constant islands. 
but you could also consider it as switching between .text and .data or 
maybe interleaved .sdata or "small data". and yes one needs also to be 
careful about pads. one could try to put host instruction text into them 
and use them as a piece of an emulator break out if you can find out how 
to modify the stack pointer to jump to these "constants" in RX pages.

I can revise the text. I see it as a reified text constant as the 
primary data with a label (pointer not object) and a relocation embedded 
inside the constant data object for performance. in a more conventional 
code generator one would create a TCGLabel and TCGReloc. and yes you are 
right. it's not really .data section. I am just curious about 
reconciling concepts between binary translators and more traditional 
compilers. but ideally without a loss in performance.

I actually have a VM in mind that has a constant stream with it's own 
counter that branches called IB (immediate base). IB is set in call 
procedure and we pack a vector into the link register with the relative 
offset of the program counter and immediate base register (i32,i32) 
gives call +/-2GiB reach. link register no longer has absolute address. 
and there is a branch instruction for the constant stream. return needs 
two immediate offsets displaced from the text and constant entry points 
to compute the return address.

I have a sketch for a design that does this. and constants don't come 
over the same memory port as data as it has dedicated fetch bandwidth 
with a constant prefetcher. I could build a front-end and back-end for 
it when I get time. a CPU with some GPU concepts but no fixed function 
rasterizer. also relocs are easier because a RISC using this gets more 
conventional aligned PC-relative relocs of whole words. so the linker 
would be a little faster. constants are all bonded register slot sized 
in the instruction form instead of a special immediate form. it needs 
extra bypassing for "large immediate" but the latency can be covered by 
adding pipeline stages. constant branch predictor in parallel up front.

>> thus TCGData is more succinct and more reflective of what the
>> structure contains; data emitted in the constant data pool at
>> the end of translation blocks. also, pool_labels is renamed to
>> pool_data as the primary contents of the list is constant data.
> 
> I guess.  TCGData is perhaps too short, but we can certainly avoid the 
> confusion of "labels".

TCGPoolData would be okay and I thought about that. the reason against 
was that it competed with TCGPool. but in some senses that might be the 
right name because it is a data constant in the pool. but we embed the 
relocation and label (as pointer) without a full TCGLabel and TCGReloc.

>> finally, new_pool_alloc and new_pool_insert are merged into a
>> single function named new_pool_data, which moves nlongs to the
>> end of the parameter list with varargs to allocate, copy, and
>> insert constant data items to simplify new_pool_label et al.
>> a successive step would be to collapse callers into calling
>> new_pool_data and remove a layer of indirection.
> 
> Why?  varargs generally produces horrible code.
> The split between alloc and insert was intentional to avoid this.

it's pretty good code on SysV because it goes via registers except for 
perhaps new_pool_l8 which will spill to stack and get copied unless the 
inliner can eliminate the copies. maybe windows has bad varargs. but I 
like the style better than deeper layers of wrapper functions. they 
should fix the compiler so that it produces better code.

btw I didn't run through a benchmark suite to check for performance 
regressions because I don't have one in the rig. it was more of a code 
comprehension comment expressed as a patch. it could be merged if it 
adds to the code comprehension without affecting performance.

>> diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
>> index a9ca493d20f6..448c2330ef0f 100644
>> --- a/tcg/tci/tcg-target.h
>> +++ b/tcg/tci/tcg-target.h
>> @@ -72,6 +72,6 @@ typedef enum {
>>   } TCGReg;
>>   #define HAVE_TCG_QEMU_TB_EXEC
>> -#define TCG_TARGET_NEED_POOL_LABELS
>> +#define TCG_TARGET_NEED_POOL_DATA
> 
> Oops, this should have been removed with a417ef83.

I will refresh.

Richard Henderson Feb. 15, 2025, 9:50 p.m. UTC | #3

On 2/15/25 12:24, Michael Clark wrote:
>> Why?  varargs generally produces horrible code.
>> The split between alloc and insert was intentional to avoid this.
> 
> it's pretty good code on SysV because it goes via registers except for perhaps new_pool_l8 
> which will spill to stack and get copied unless the inliner can eliminate the copies. 
> maybe windows has bad varargs. but I like the style better than deeper layers of wrapper 
> functions. they should fix the compiler so that it produces better code.

varargs will generally prevent inlining.

Indeed, just a quick look at aarch64 tcg_out_movi shows that without your change, 
new_pool_label, new_pool_data and new_pool_insert are all inlined.  With your change, 
new_pool_data is not inlined, all the argument regs are dumped to the stack, etc.

I don't think this is a good change to make.

r~

Michael Clark Feb. 15, 2025, 10:40 p.m. UTC | #4

On 2/16/25 10:50, Richard Henderson wrote:
> On 2/15/25 12:24, Michael Clark wrote:
>>> Why?  varargs generally produces horrible code.
>>> The split between alloc and insert was intentional to avoid this.
>>
>> it's pretty good code on SysV because it goes via registers except for 
>> perhaps new_pool_l8 which will spill to stack and get copied unless 
>> the inliner can eliminate the copies. maybe windows has bad varargs. 
>> but I like the style better than deeper layers of wrapper functions. 
>> they should fix the compiler so that it produces better code.
> 
> varargs will generally prevent inlining.
> 
> Indeed, just a quick look at aarch64 tcg_out_movi shows that without 
> your change, new_pool_label, new_pool_data and new_pool_insert are all 
> inlined.  With your change, new_pool_data is not inlined, all the 
> argument regs are dumped to the stack, etc.
> 
> I don't think this is a good change to make.

fixing varargs codegen in GCC/Clang would be a good change. count based 
varargs can be reasoned about statically relatively easily. what is it 
like with an explicit inline as opposed to just static? I will inspect 
anyhow. as varargs with the argument count moved and long as opposed to 
format strings is a reasonable pattern for generics in C given C is so 
bad at generics. saves lots of long form delegating functions.

I like the representation but understand if you don't take the patch.

Michael Clark Feb. 15, 2025, 10:58 p.m. UTC | #5

On 2/16/25 09:24, Michael Clark wrote:
> I actually have a VM in mind that has a constant stream with it's own 
> counter that branches called IB (immediate base). IB is set in call 
> procedure and we pack a vector into the link register with the relative 
> offset of the program counter and immediate base register (i32,i32) 
> gives call ±2GiB reach. link register no longer has absolute address. 
> and there is a branch instruction for the constant stream. return needs 
> two immediate offsets displaced from the text and constant entry points 
> to compute the return address.

btw here this is the instruction packet I have in mind for a CPU with 
immediate blocks. it has much simpler length decoding than RISC-V and 
less instruction forms too. two regslots can be bonded for a larger 
indirect immediate slot using a constant from the current constant block 
in the constant stream relative to PC/IB. 64-bit instructions can have 6 
inputs/outputs which I have packed for vector in a sketch and it works. 
it doesn't have 48-bit instructions because we simplified length 
decoding compared to RISC-V. and decoding can be more easily vectorized 
than RISC-V but we lost 1 bit in the 16-bit packet due to that. and I 
have a python model that synthesizes logic for parallel decoders

https://anarch128.org/~mclark/VLI.pdf

I'm filling out the 16-bit opcode space. I need to add a constant with a 
(i32,i32) relative immediate to ret to adjust for relative link register 
so the return also needs a constant from the constant block aswell. it's 
symmetrical but I don't have a software model yet.

https://gist.github.com/michaeljclark/8f9b81e5e40488035dc252c9da3ecc2e

in two to three years I may have a POC translator. I'm still working on 
codegen for a new X86 backend. for VLI, 16-wide or 8 x 32-bit packets 
(256-bits) is plausible with vastly simpler decode logic than X86. 
that's like a 16/32/64-bit RISC front-end. at the moment I am thinking 
of allowing 16-bit alignment for larger instruction words but I haven't 
tried to synthesize a decoder yet. but I think much simpler than RISC-V 
because 2 wires per packet for length vs up-to 7.

Richard Henderson Feb. 15, 2025, 11:41 p.m. UTC | #6

On 2/15/25 14:40, Michael Clark wrote:
> On 2/16/25 10:50, Richard Henderson wrote:
>> On 2/15/25 12:24, Michael Clark wrote:
>>>> Why?  varargs generally produces horrible code.
>>>> The split between alloc and insert was intentional to avoid this.
>>>
>>> it's pretty good code on SysV because it goes via registers except for perhaps 
>>> new_pool_l8 which will spill to stack and get copied unless the inliner can eliminate 
>>> the copies. maybe windows has bad varargs. but I like the style better than deeper 
>>> layers of wrapper functions. they should fix the compiler so that it produces better code.
>>
>> varargs will generally prevent inlining.
>>
>> Indeed, just a quick look at aarch64 tcg_out_movi shows that without your change, 
>> new_pool_label, new_pool_data and new_pool_insert are all inlined.  With your change, 
>> new_pool_data is not inlined, all the argument regs are dumped to the stack, etc.
>>
>> I don't think this is a good change to make.
> 
> fixing varargs codegen in GCC/Clang would be a good change. count based varargs can be 
> reasoned about statically relatively easily. what is it like with an explicit inline as 
> opposed to just static?

Inline will still be rejected.


r~

Michael Clark Feb. 16, 2025, 12:48 a.m. UTC | #7

On 2/16/25 12:41, Richard Henderson wrote:
>>> I don't think this is a good change to make.
>>
>> fixing varargs codegen in GCC/Clang would be a good change. count 
>> based varargs can be reasoned about statically relatively easily. what 
>> is it like with an explicit inline as opposed to just static?
> 
> Inline will still be rejected.

oh wow. fair enough. I didn't know varargs was so broken. Clang even 
appears to be emitting weird extern functions with inline varargs that 
change when another invocation is added. taking away inline fixes it. 
this is unusual and I think this is a Clang compiler bug. flip the #if. 
code seems to be conforming or is it a bug? might post to Clang folks.

https://godbolt.org/z/WsEPfdfbE

Michael Clark Feb. 16, 2025, 8 a.m. UTC | #8

On 2/16/25 06:58, Richard Henderson wrote:
> 
>> the label member is merely a pointer to the instruction text to
>> be updated with the relative address of the constant, the primary
>> data is the constant data pool at the end of translation blocks.
>> this relates more closely to .data sections in offline codegen
>> if we were to imagine a translation block has .text and .data.
> 
> No, it doesn't.  It relates most closely to data emitted within .text, 
> accessed via pc-relative instructions with limited offsets.
> 
> This isn't a thing you'd have ever seen on x86 or x86_64, but it is 
> quite common for arm32 (12-bit offsets), sh4 (8-bit offsets), m68k (16- 
> bit offsets) and such.  Because the offsets are so small, they could 
> even be placed *within* functions not just between them.

I mentioned before I like the idea and have thought about architectures 
with constant streams and constant branch units.

say for arguments sake we considered it 'TCData' with embedded label and 
reloc (the purpose is the constant after after all, just it is not a 
TCGTemp, it's an explicitly reified constant in the codegen emitters). 
wondering if we could add a "disposition" field to control placement. 
TCG_DISP_TEXT_TB, TCG_DISP_DATA, etc. this way you could ask the code 
generator to do something more conventional while still supporting the 
short relative constant islands. "disposition" might be better than 
section as a name. also a DATA section could be mmap R without X perms 
to lessen the risk of injecting code as constants.

disposition; "the way in which something is placed or arranged, 
especially in relation to other things."

TCGConstant is another alternative I would consider as okay. distinct 
from TCGTemp of type TEMP_CONST which is heavier weight. it makes one 
wonder about reification of large implicit constants as opposed to the 
explicitly emitted ones we are talking about here.

TCGConstant might be better.

i'm looking at a TCG source-compatible code generator as an option so I 
may experiment locally. it is a private interface at the moment anyhow. 
that just seemed inconsistent as most structure definitions are in the 
header. but I understand it is a private interface.

the patch was code comprehension. i'm studying TCG interfaces and code 
at the moment. i'd like to stay source compatible as much as possible.

Michael.

Richard Henderson Feb. 16, 2025, 6:01 p.m. UTC | #9

On 2/16/25 00:00, Michael Clark wrote:
> On 2/16/25 06:58, Richard Henderson wrote:
>>
>>> the label member is merely a pointer to the instruction text to
>>> be updated with the relative address of the constant, the primary
>>> data is the constant data pool at the end of translation blocks.
>>> this relates more closely to .data sections in offline codegen
>>> if we were to imagine a translation block has .text and .data.
>>
>> No, it doesn't.  It relates most closely to data emitted within .text, accessed via pc- 
>> relative instructions with limited offsets.
>>
>> This isn't a thing you'd have ever seen on x86 or x86_64, but it is quite common for 
>> arm32 (12-bit offsets), sh4 (8-bit offsets), m68k (16- bit offsets) and such.  Because 
>> the offsets are so small, they could even be placed *within* functions not just between 
>> them.
> 
> I mentioned before I like the idea and have thought about architectures with constant 
> streams and constant branch units.
> 
> say for arguments sake we considered it 'TCData' with embedded label and reloc (the 
> purpose is the constant after after all, just it is not a TCGTemp, it's an explicitly 
> reified constant in the codegen emitters). wondering if we could add a "disposition" field 
> to control placement. TCG_DISP_TEXT_TB, TCG_DISP_DATA, etc. this way you could ask the 
> code generator to do something more conventional while still supporting the short relative 
> constant islands. "disposition" might be better than section as a name. also a DATA 
> section could be mmap R without X perms to lessen the risk of injecting code as constants.

I don't think there's any point to doing anything differently than we currently do: place 
the data at the end of the TB.

(1) The architectures that we host and use the constant pool currently have
     relatively large displacements: aarch64 (21 bit), x86_64 (32 bit),
     ppc (16 or 34 bit (power10 only)), riscv (32 bit), s390x (34 bit).

(2) The size of a TB pretty generally maxes out at 3-4k, but is firmly capped at 64k
     by uint16_t TranslationBlock.jmp_reset_offset.

(3) The 16 and 21-bit offsets are not large enough to stretch to a read-only mapping.

(4) Memory management of TranslationBlocks becomes *much* more complicated.

> TCGConstant is another alternative I would consider as okay. distinct from TCGTemp of type 
> TEMP_CONST which is heavier weight. it makes one wonder about reification of large 
> implicit constants as opposed to the explicitly emitted ones we are talking about here.

TCGConstant isn't bad, but I think I prefer TCGPoolData as mooted before.

> i'm looking at a TCG source-compatible code generator as an option so I may experiment 
> locally. it is a private interface at the moment anyhow. that just seemed inconsistent as 
> most structure definitions are in the header. but I understand it is a private interface.

The organization of tcg.h is from antiquity.  I am actively trying to reduce the size of 
the exported API.


r~

tcg: refactor pool data for simplicity and comprehension

Commit Message

Comments

Patch