Message ID | 20220920174603.302510-5-aaronlewis@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Introduce and test masked events | expand |
On Tue, Sep 20, 2022, Aaron Lewis wrote: > +To access individual components in a masked entry use: > +:: > + struct kvm_pmu_event_masked_entry { > + union { > + __u64 raw; > + struct { > + /* event_select bits 11:8 are only valid on AMD. */ > + __u64 event_select:12; > + __u64 mask:8; > + __u64 match:8; > + __u64 exclude:1; As suggested in patch 3, keep the architectural bits where they are and fill in the gaps. IIUC, all of bits 63:36 are available, i.e. there's lots of room for expansion. Go top down (start at 63) and cross our fingers that neither Intel nor AMD puts stuff there. If that does happen, then we can start mucking with the layout, but until then, let's not make it too hard for ourselves. > + __u64 rsvd:35; > #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */ > #define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */ > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c > index 7ce8bfafea91..b188ddb23f75 100644 > --- a/arch/x86/kvm/pmu.c > +++ b/arch/x86/kvm/pmu.c > @@ -252,34 +252,61 @@ static inline u8 get_unit_mask(u64 eventsel) > return (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8; > } ... > +/* > + * Sort will order the list by exclude, then event select. This function will > + * then index the sublists of event selects such that when a search is done on > + * the list, the head of the event select sublist is returned. This simplifies > + * the logic in filter_contains_match() when walking the list. Unless I'm missing something, this is a complex way to solve a relatively simple problem. You want a list of "includes" and a list of "excludes". Just have two separate lists. Actually, if we're effectively changing the ABI, why not make everyone's lives simpler and expose that to userspace. E.g. use one of the "pad" words to specify the number of "include" events and effectively do this: struct kvm_pmu_event_filter { __u32 action; __u32 nevents; __u32 fixed_counter_bitmap; __u32 flags; __u32 nr_include_events; __u64 include[nr_include_events]; __u64 exclude[nevents - nr_allowed_events]; }; Then we don't need to steal a bit for "exclude" in the uABI. The kernel code gets a wee bit simpler. > @@ -693,6 +796,10 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp) > /* Ensure nevents can't be changed between the user copies. */ > *filter = tmp; > > + r = -EINVAL; > + if (!is_filter_valid(filter)) Do this in prepare_filter_events() to avoid walking the filter multiple times. I've gotten fairly twisted around and my local copy of the code is a mess, but something like this: static int prepare_filter_events(struct kvm_pmu_event_filter *filter) { int i; for (i = 0, j = 0; i < filter->nevents; i++) { if (filter->events[i] & ~kvm_pmu_ops.EVENTSEL_MASK) { if (!filter->mask) continue; if (<reserved bits set>) return -EINVAL; } filter->events[j++] = filter->events[i]; } <more magic here> } > + goto cleanup; > + > prepare_filter_events(filter); > > mutex_lock(&kvm->lock);
On Tue, Sep 20, 2022, Aaron Lewis wrote: > static void convert_to_filter_events(struct kvm_pmu_event_filter *filter) > @@ -645,7 +719,34 @@ static void convert_to_filter_events(struct kvm_pmu_event_filter *filter) > for (i = 0; i < filter->nevents; i++) { > u64 e = filter->events[i]; > > - filter->events[i] = encode_filter_entry(e); > + filter->events[i] = encode_filter_entry(e, filter->flags); > + } > +} > + > +/* > + * Sort will order the list by exclude, then event select. This function will > + * then index the sublists of event selects such that when a search is done on > + * the list, the head of the event select sublist is returned. This simplifies > + * the logic in filter_contains_match() when walking the list. I'm not so sure that this is simpler overall though. If inclusive vs. exclusive are separate lists, then avoiding "index" would mean there's no need to convert entries. And IIUC, the only thing this saves in filter_contains_match() is having to walk backwards, e.g. it's for (i = index; i < filter->nevents; i++) { if (!<eventsel event match>) break; if (is_filter_match(...)) return true; } return false; versus for (i = index; i < filter->nevents; i++) { if (filter_event_cmp(eventsel, filter->events[i])) break; if (is_filter_match(eventsel, filter->events[i])) return true; } for (i = index - 1; i > 0; i--) { if (filter_event_cmp(eventsel, filter->events[i])) break; if (is_filter_match(eventsel, filter->events[i])) return true; } return false; It's definitely _more_ code in filter_contains_match(), and the duplicate code is unfortunate, but I wouldn't necessarily say it's simpler. There's a fair bit of complexity in understanding the indexing scheme, it's just hidden. And I believe if the indexing is dropped, then the same filter_event_cmp() helper can be used for sort() and bsearch(), and for bounding the walks. And here's also no need to "encode" entries or use a second struct overly. We might need separate logic for the existing non-masked mechanism, but again that's only more code, IMO it's not more complex. E.g. I believe that the legacy case can be handled with a dedicated: if (!(filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS)) return find_filter_index(..., cmp_u64) > 0; Oh, and as a bonus of splitting include vs. exclude, the legacy case effectively optimizes exclude since the length of the exclude array will be '0'. If we do keep the indexing, I think we should rename "index" to "head", e.g. like "head pages", to make it more obvious that the helper returns the head of a list. > + */ > +static void index_filter_events(struct kvm_pmu_event_filter *filter) > +{ > + struct kvm_pmu_filter_entry *prev, *curr; > + int i, index = 0; > + > + if (filter->nevents) > + prev = (struct kvm_pmu_filter_entry *)(filter->events); > + > + for (i = 0; i < filter->nevents; i++) { > + curr = (struct kvm_pmu_filter_entry *)(&filter->events[i]); > + > + if (curr->event_select != prev->event_select || > + curr->exclude != prev->exclude) { > + index = 0; > + prev = curr; > + } > + > + curr->event_index = index++; > } > } > + * When filter events are converted into this format then sorted, the > + * resulting list naturally ends up in two sublists. One for the 'include > + * list' and one for the 'exclude list'. These sublists are further broken > + * down into sublists ordered by their event select. After that, the > + * event select sublists are indexed such that a search for: exclude = n, > + * event_select = n, and event_index = 0 will return the head of an event > + * select sublist that can be walked to see if a match exists. > + */ > struct kvm_pmu_filter_entry { > union { > u64 raw; > struct { > + u64 mask:8; > + u64 match:8; > + u64 event_index:12; This is broken. There are 2^12 possible event_select values, but event_index is the index into the full list of events, i.e. is bounded only by nevents, and so this needs to be stored as a 32-bit value. E.g. if userspace creates a filter with 2^32-2 entries for eventsel==0, then the index for eventsel==1 will be 2^32-1 even though there are only two event_select values in the entire list. > u64 event_select:12; > - u64 unit_mask:8; > - u64 rsvd:44; > + u64 exclude:1; > + u64 rsvd:23; > }; > }; > };
On Mon, Oct 10, 2022 at 8:42 AM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Sep 20, 2022, Aaron Lewis wrote: > > static void convert_to_filter_events(struct kvm_pmu_event_filter *filter) > > @@ -645,7 +719,34 @@ static void convert_to_filter_events(struct kvm_pmu_event_filter *filter) > > for (i = 0; i < filter->nevents; i++) { > > u64 e = filter->events[i]; > > > > - filter->events[i] = encode_filter_entry(e); > > + filter->events[i] = encode_filter_entry(e, filter->flags); > > + } > > +} > > + > > +/* > > + * Sort will order the list by exclude, then event select. This function will > > + * then index the sublists of event selects such that when a search is done on > > + * the list, the head of the event select sublist is returned. This simplifies > > + * the logic in filter_contains_match() when walking the list. > Including previous comments so I can respond in one place: > Unless I'm missing something, this is a complex way to solve a relatively simple > problem. You want a list of "includes" and a list of "excludes". Just have two > separate lists. > > Actually, if we're effectively changing the ABI, why not make everyone's lives > simpler and expose that to userspace. E.g. use one of the "pad" words to specify > the number of "include" events and effectively do this: > > struct kvm_pmu_event_filter { > __u32 action; > __u32 nevents; > __u32 fixed_counter_bitmap; > __u32 flags; > __u32 nr_include_events; > __u64 include[nr_include_events]; > __u64 exclude[nevents - nr_allowed_events]; > }; > > Then we don't need to steal a bit for "exclude" in the uABI. The kernel code > gets a wee bit simpler. <end of previous comments> I'm not sure I understand the struct changes you're proposing. Is this what you mean? struct kvm_pmu_event_filter { __u32 action; __u32 nevents; __u32 fixed_counter_bitmap; __u32 flags; __u32 nr_include_events; __u32 pad; __u64 *include; // length == nr_include_events __u64 exclude[]; // length == nevents - nr_include_events }; I considered having an include list and exclude list on the filter, but I thought that was too much detail to share with userspace. I've spent too much time lately trying to change ABI's or wishing they were different. I'd prefer to share as little as possible with userspace at this point in hopes of allowing us to change our minds or evolve this in the future. To that end, if we use a bit in the event to distinguish between an include list and an exclude list the only thing we're locked into is the type of event being introduced with masked events, which would be easy to iterate on or change. Then, if we sort the list like I do, it's naturally split into an include list and an exclude list anyway, so we essentially have the same thing in the end. At that point if we want to explicitly have an include list and an exclude list we could make an internal struct for kvm_pmu_event_filter similar to what msr filtering does, but I'm not sure that's necessary. I think setting it up this way makes the data coming from userspace less error prone and easier for them to work with because the lists are being separated for them into include and exclude, and the order they build it in doesn't matter. Isn't having the include list pointer exposed to userspace a little awkward? As is I'm not sure why userspace would care about it, and when we load the filter we'd essentially ignore it. It feels like an internal variable that's exposed to userspace. Also, the naming is confusing when working with non-masked events. E.g. the "exclude" list would be used in place of what was the "events" list. That seems less intuitive. > I'm not so sure that this is simpler overall though. If inclusive vs. exclusive > are separate lists, then avoiding "index" would mean there's no need to convert > entries. And IIUC, the only thing this saves in filter_contains_match() is > having to walk backwards, e.g. it's > > for (i = index; i < filter->nevents; i++) { > if (!<eventsel event match>) > break; > > if (is_filter_match(...)) > return true; > } > > return false; > > versus > > for (i = index; i < filter->nevents; i++) { > if (filter_event_cmp(eventsel, filter->events[i])) > break; > > if (is_filter_match(eventsel, filter->events[i])) > return true; > } > > for (i = index - 1; i > 0; i--) { > if (filter_event_cmp(eventsel, filter->events[i])) > break; > > if (is_filter_match(eventsel, filter->events[i])) > return true; > } > > return false; > > It's definitely _more_ code in filter_contains_match(), and the duplicate code is > unfortunate, but I wouldn't necessarily say it's simpler. There's a fair bit of > complexity in understanding the indexing scheme, it's just hidden. > I think indexing the data is nice to have. The only additional cost is walking the list when a filter is loaded, then we are able to search the head of the list and not have to do this odd walk forward then backward business. I'm open to dropping it if it's causing more confusion than it's worth. And as you pointed out, I believe not having it would allow for the use of filter_event_cmp() during bounding walks. > And I believe if the indexing is dropped, then the same filter_event_cmp() helper > can be used for sort() and bsearch(), and for bounding the walks. > > And here's also no need to "encode" entries or use a second struct overly. > Encoding events solves a different problem than indexing. It converts the events that come from userspace into a common format we can use internally. That allows the kernel to have a common code path, no matter what type of event it started out as. > We might need separate logic for the existing non-masked mechanism, but again > that's only more code, IMO it's not more complex. E.g. I believe that the legacy > case can be handled with a dedicated: > > if (!(filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS)) > return find_filter_index(..., cmp_u64) > 0; > I'd rather not drop encoding, however, if we did, sorting and searching would both have to be special cased. Masked events and non-masked events are not compatible with each other that way. If the list is sorted for masked events, you can't do a cmp_u64 search on it and expect to find anything. I suppose you could if using the struct overlay and we included match (which doubles as the unit mask for non-masked events) in the comparer and dropped indexing. But that only works because the struct overlay keeps the event select contiguous. The fact that the architectural format has the event select straddle the unit mask prevents that from working. > Oh, and as a bonus of splitting include vs. exclude, the legacy case effectively > optimizes exclude since the length of the exclude array will be '0'. > This should apply to both implementations. I didn't add an include list and exclude list, but the data is laid out for it. The main reason I didn't do that is I didn't think there was enough of a win to create an internal copy of kvm_pmu_event_filter (stated above), and it'd be exposing too much of the implementation to userspace to add it there. > If we do keep the indexing, I think we should rename "index" to "head", e.g. like > "head pages", to make it more obvious that the helper returns the head of a list. > > > + */ > > +static void index_filter_events(struct kvm_pmu_event_filter *filter) > > +{ > > + struct kvm_pmu_filter_entry *prev, *curr; > > + int i, index = 0; > > + > > + if (filter->nevents) > > + prev = (struct kvm_pmu_filter_entry *)(filter->events); > > + > > + for (i = 0; i < filter->nevents; i++) { > > + curr = (struct kvm_pmu_filter_entry *)(&filter->events[i]); > > + > > + if (curr->event_select != prev->event_select || > > + curr->exclude != prev->exclude) { > > + index = 0; > > + prev = curr; > > + } > > + > > + curr->event_index = index++; > > } > > } > > + * When filter events are converted into this format then sorted, the > > + * resulting list naturally ends up in two sublists. One for the 'include > > + * list' and one for the 'exclude list'. These sublists are further broken > > + * down into sublists ordered by their event select. After that, the > > + * event select sublists are indexed such that a search for: exclude = n, > > + * event_select = n, and event_index = 0 will return the head of an event > > + * select sublist that can be walked to see if a match exists. > > + */ > > struct kvm_pmu_filter_entry { > > union { > > u64 raw; > > struct { > > + u64 mask:8; > > + u64 match:8; > > + u64 event_index:12; > > This is broken. There are 2^12 possible event_select values, but event_index is > the index into the full list of events, i.e. is bounded only by nevents, and so > this needs to be stored as a 32-bit value. E.g. if userspace creates a filter > with 2^32-2 entries for eventsel==0, then the index for eventsel==1 will be > 2^32-1 even though there are only two event_select values in the entire list. > nevents <= KVM_PMU_EVENT_FILTER_MAX_EVENTS (300), so there is plenty of space in event_index to cover any valid index. Though a moot point if we cut it. > > u64 event_select:12; > > - u64 unit_mask:8; > > - u64 rsvd:44; > > + u64 exclude:1; > > + u64 rsvd:23; > > }; > > }; > > };
On Sat, Oct 15, 2022, Aaron Lewis wrote: > On Mon, Oct 10, 2022 at 8:42 AM Sean Christopherson <seanjc@google.com> wrote: > > Unless I'm missing something, this is a complex way to solve a relatively simple > > problem. You want a list of "includes" and a list of "excludes". Just have two > > separate lists. > > > > Actually, if we're effectively changing the ABI, why not make everyone's lives > > simpler and expose that to userspace. E.g. use one of the "pad" words to specify > > the number of "include" events and effectively do this: > > > > struct kvm_pmu_event_filter { > > __u32 action; > > __u32 nevents; > > __u32 fixed_counter_bitmap; > > __u32 flags; > > __u32 nr_include_events; > > __u64 include[nr_include_events]; > > __u64 exclude[nevents - nr_allowed_events]; > > }; > > > > Then we don't need to steal a bit for "exclude" in the uABI. The kernel code > > gets a wee bit simpler. > > <end of previous comments> > > I'm not sure I understand the struct changes you're proposing. Is > this what you mean? > > struct kvm_pmu_event_filter { > __u32 action; > __u32 nevents; > __u32 fixed_counter_bitmap; > __u32 flags; > __u32 nr_include_events; > __u32 pad; > __u64 *include; // length == nr_include_events > __u64 exclude[]; // length == nevents - nr_include_events Ya, something like that. > }; > > I considered having an include list and exclude list on the filter, > but I thought that was too much detail to share with userspace. I've > spent too much time lately trying to change ABI's or wishing they were > different. I'd prefer to share as little as possible with userspace at > this point in hopes of allowing us to change our minds or evolve this > in the future. I don't think a list vs. a flag shares any more or less with userspace, e.g. KVM could very well merge the lists internally. > To that end, if we use a bit in the event to distinguish between an > include list and an exclude list the only thing we're locked into is > the type of event being introduced with masked events, which would be > easy to iterate on or change. As above, we're not locked into an internal implementation either way. > Then, if we sort the list like I do, it's naturally split into an include > list and an exclude list anyway, so we essentially have the same thing in the > end. At that point if we want to explicitly have an include list and an > exclude list we could make an internal struct for kvm_pmu_event_filter > similar to what msr filtering does, but I'm not sure that's necessary. > > I think setting it up this way makes the data coming from userspace > less error prone and easier for them to work with because the lists > are being separated for them into include and exclude, and the order > they build it in doesn't matter. The above said, I agree that building the separate lists would be obnoxious for userspace. Userspace would either need to know the length of the "includes" list up front, or would have to build the "excludes" list separately and then fold it in at the end. Unless we might run out of bits, lets keep the flag approach for the uAPI. > Isn't having the include list pointer exposed to userspace a little > awkward? As is I'm not sure why userspace would care about it, and > when we load the filter we'd essentially ignore it. I don't follow. I assume userspace cares about specifying "includes", otherwise why provide that functionality? > It feels like an internal variable that's exposed to userspace. Also, the > naming is confusing when working with non-masked events. E.g. the "exclude" > list would be used in place of what was the "events" list. That seems less > intuitive. Honestly, I find the "includes vs. excludes" terminology to be quite confusing regardless of what API is presented to userspace. Can't think of better names though. > > I'm not so sure that this is simpler overall though. If inclusive vs. exclusive > > are separate lists, then avoiding "index" would mean there's no need to convert > > entries. And IIUC, the only thing this saves in filter_contains_match() is > > having to walk backwards, e.g. it's > > > > for (i = index; i < filter->nevents; i++) { > > if (!<eventsel event match>) > > break; > > > > if (is_filter_match(...)) > > return true; > > } > > > > return false; > > > > versus > > > > for (i = index; i < filter->nevents; i++) { > > if (filter_event_cmp(eventsel, filter->events[i])) > > break; > > > > if (is_filter_match(eventsel, filter->events[i])) > > return true; > > } > > > > for (i = index - 1; i > 0; i--) { > > if (filter_event_cmp(eventsel, filter->events[i])) > > break; > > > > if (is_filter_match(eventsel, filter->events[i])) > > return true; > > } > > > > return false; > > > > It's definitely _more_ code in filter_contains_match(), and the duplicate code is > > unfortunate, but I wouldn't necessarily say it's simpler. There's a fair bit of > > complexity in understanding the indexing scheme, it's just hidden. > > > > I think indexing the data is nice to have. The only additional cost > is walking the list when a filter is loaded, then we are able to > search the head of the list and not have to do this odd walk forward > then backward business. I'm not concerned about the setup cost to create the sub-lists, my thoughts are purely from a "how easy is it to understand" perspective, closely followed by "how performant is the runtime code". IIUC, the runtime performance is basically equivalent since the comparison function will need to mask off bits either way. For the forward+backward, IMO it's easy to understand with a single comment above the forward+backward walks: /* * Entries are sorted by eventsel, walk the list in both directions to * process all entries with the target eventsel. */ Sorting should probably have a comment too, but critically, understanding how the list is sorted doesn't require full understanding of how lookups are processed. E.g. a comment like this would suffice for sorting: /* * Sort entries by eventsel so that all entries for a given eventsel can * be processed effeciently during filter. */ The sub-list on the other hand requires comments to explain the sub-list concept, document the indexing code (and its comparator?), the #define for the head bits (and/or struct field), and the walk code code that processes the sublist. In addition to having more things to document, the comments will be spread out, and IMO it's difficult to understand each individual piece without having a good grasp of the overall implementation. Using "head" instead of "index" would help quite a bit for the lookup+walk, but I think the extra pieces will still leave readers wondering "why?. E.g. Why is there a separate comparator for sorting vs. lookup? Why is there a "head" field? What problem do sub-lists solve? > I'm open to dropping it if it's causing more confusion than it's worth. And > as you pointed out, I believe not having it would allow for the use of > filter_event_cmp() during bounding walks. > > > And I believe if the indexing is dropped, then the same filter_event_cmp() helper > > can be used for sort() and bsearch(), and for bounding the walks. > > > > And here's also no need to "encode" entries or use a second struct overly. > > > > Encoding events solves a different problem than indexing. It converts > the events that come from userspace into a common format we can use > internally. That allows the kernel to have a common code path, no > matter what type of event it started out as. I don't see why encoding is necessary to achieve a common internal imlementation though. To make sure we're talking about the same thing, by "encoding" I mean having different layouts for the same data in the uAPI struct than the in-kernel struct. I'm perfectly ok having bits in the uAPI struct that are dropped when building the in-kernel lists, e.g. the EXCLUDE bit. Ditto for having bits in the in-kernel struct that don't exist in the uAPI struct, e.g. the "head" (index) of the sublist if we go that route. My objection to "encoding" is moving the eventsel bits around. AFAICT, it's not strictly necessary, and similar to the sub-list approach, that raises that question of "why?". In all cases, aren't the "eventsel" bits always contained in bits 35:32 and 7:0? The comparator can simply mask off all other bits in order to find a filter with the specified eventsel, no? > > We might need separate logic for the existing non-masked mechanism, but again > > that's only more code, IMO it's not more complex. E.g. I believe that the legacy > > case can be handled with a dedicated: > > > > if (!(filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS)) > > return find_filter_index(..., cmp_u64) > 0; > > > > I'd rather not drop encoding, however, if we did, sorting and > searching would both have to be special cased. Masked events and > non-masked events are not compatible with each other that way. If the > list is sorted for masked events, you can't do a cmp_u64 search on it > and expect to find anything. I'm not sure I follow. As above, doesn't the search for masked vs. non-masked only differ on the comparator? > I suppose you could if using the struct overlay and we included match (which > doubles as the unit mask for non-masked events) in the comparer and dropped > indexing. Why does "match" need to be in the comparer? I thought the sub-lists are created for each "eventsel"? > But that only works because the struct overlay keeps the event select > contiguous. The fact that the architectural format has the event select > straddle the unit mask prevents that from working. > > > Oh, and as a bonus of splitting include vs. exclude, the legacy case effectively > > optimizes exclude since the length of the exclude array will be '0'. > > This should apply to both implementations. If everything is in a single list, doesn't the exclude lookup need to search the entire thing to determine that there are no entries? > I didn't add an include list and exclude list, but the data is laid out for > it. The main reason I didn't do that is I didn't think there was enough of a > win to create an internal copy of kvm_pmu_event_filter (stated above), and > it'd be exposing too much of the implementation to userspace to add it there. > > > If we do keep the indexing, I think we should rename "index" to "head", e.g. like > > "head pages", to make it more obvious that the helper returns the head of a list. ... > > > struct kvm_pmu_filter_entry { > > > union { > > > u64 raw; > > > struct { > > > + u64 mask:8; > > > + u64 match:8; > > > + u64 event_index:12; > > > > This is broken. There are 2^12 possible event_select values, but event_index is > > the index into the full list of events, i.e. is bounded only by nevents, and so > > this needs to be stored as a 32-bit value. E.g. if userspace creates a filter > > with 2^32-2 entries for eventsel==0, then the index for eventsel==1 will be > > 2^32-1 even though there are only two event_select values in the entire list. > > > > nevents <= KVM_PMU_EVENT_FILTER_MAX_EVENTS (300), so there is plenty > of space in event_index to cover any valid index. Though a moot point > if we cut it. Oooh. In that case, won't 9 bits suffice? Using 12 implies there's a connection to eventsel, which is misleading and confusing. Regardless, if we keep the indexing, there should be a BUILD_BUG_ON() to assert that the "head" (event_index) field is larger enough to hold the max number of events. Not sure if there's a way to get the number of bits, i.e. might require a separate #define to get the compile-time assert. :-/
> > I think setting it up this way makes the data coming from userspace > > less error prone and easier for them to work with because the lists > > are being separated for them into include and exclude, and the order > > they build it in doesn't matter. > > The above said, I agree that building the separate lists would be obnoxious for > userspace. Userspace would either need to know the length of the "includes" list > up front, or would have to build the "excludes" list separately and then fold it > in at the end. Unless we might run out of bits, lets keep the flag approach for > the uAPI. > > > Isn't having the include list pointer exposed to userspace a little > > awkward? As is I'm not sure why userspace would care about it, and > > when we load the filter we'd essentially ignore it. > > I don't follow. I assume userspace cares about specifying "includes", otherwise > why provide that functionality? Yes, userspace cares about includes, but I'm not sure they will care about the "include" pointer because we would ignore it when loading a filter (there's no information it provides we can get from the other fields) and it would be cumbersome for them to use as you pointed out. So, I was just guessing that they wouldn't use it. > > I'm not concerned about the setup cost to create the sub-lists, my thoughts are > purely from a "how easy is it to understand" perspective, closely followed by > "how performant is the runtime code". IIUC, the runtime performance is basically > equivalent since the comparison function will need to mask off bits either way. > > For the forward+backward, IMO it's easy to understand with a single comment above > the forward+backward walks: > > /* > * Entries are sorted by eventsel, walk the list in both directions to > * process all entries with the target eventsel. > */ > > Sorting should probably have a comment too, but critically, understanding how the > list is sorted doesn't require full understanding of how lookups are processed. > E.g. a comment like this would suffice for sorting: > > /* > * Sort entries by eventsel so that all entries for a given eventsel can > * be processed efficiently during filter. > */ > > The sub-list on the other hand requires comments to explain the sub-list concept, > document the indexing code (and its comparator?), the #define for the head bits > (and/or struct field), and the walk code code that processes the sublist. > > In addition to having more things to document, the comments will be spread out, > and IMO it's difficult to understand each individual piece without having a good > grasp of the overall implementation. Using "head" instead of "index" would help > quite a bit for the lookup+walk, but I think the extra pieces will still leave > readers wondering "why?. E.g. Why is there a separate comparator for sorting > vs. lookup? Why is there a "head" field? What problem do sub-lists solve? > I'll drop indexing. > > I don't see why encoding is necessary to achieve a common internal implementation > though. To make sure we're talking about the same thing, by "encoding" I mean > having different layouts for the same data in the uAPI struct than the in-kernel > struct. I'm perfectly ok having bits in the uAPI struct that are dropped when > building the in-kernel lists, e.g. the EXCLUDE bit. Ditto for having bits in the > in-kernel struct that don't exist in the uAPI struct, e.g. the "head" (index) of > the sublist if we go that route. > > My objection to "encoding" is moving the eventsel bits around. AFAICT, it's not > strictly necessary, and similar to the sub-list approach, that raises that question > of "why?". In all cases, aren't the "eventsel" bits always contained in bits 35:32 > and 7:0? The comparator can simply mask off all other bits in order to find a > filter with the specified eventsel, no? Strictly speaking, yes... as long as you are aware of some boundaries and work within them. Problem: -------- Given the event select + unit mask pairs: {0x101, 0x1}, {0x101, 0x2}, {0x102, 0x1} If we use the architectural layout we get: 0x10011, 0x10021, 0x10012. Sorted by filter_event_cmp(), i.e. event select, it's possible to get: 0x10012, 0x10011, 0x10021. Then if a search for {0x101, 0x2} with cmp_u64() is done it wouldn't find anything because 0x10021 is deemed less than 0x10011 in this list. That's why adding find_filter_index(..., cmp_u64) > 0 doesn't work to special case legacy events. Similarly, if sorted by cmp_u64() we get: 0x10021, 0x10012, 0x10011. Then a search with filter_event_cmp() for 0x10021 would fail as well. Possible workaround #1: ----------------------- If we use the internal layout (removed index). struct kvm_pmu_filter_entry { union { u64 raw; struct { u64 mask:8; u64 match:8; // doubles as umask, i.e. encode_filter_entry() u64 event_select:12; u64 exclude:1; u64 rsvd:35; }; }; }; Because the event_select is contiguous, all of its bits are above match, and the fields in kvm_pmu_filter_entry are ordered from most important to least important from a sort perspective, a sort with cmp_u64() gives us: 0x10210, 0x10120, 0x10110. This order will work for both legacy events and masked events. So, as you suggested, if we wanted to add the following to is_gp_event_allowed() we can special case legacy events and early out: if (!(filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS)) return find_filter_index(..., cmp_u64) > 0; Or we can leave it as is and the mask events path will come up with the same result. Possible workaround #2: ----------------------- If we use the architectural layout for masked events, e.g.: struct kvm_pmu_event_filter_entry { union { __u64 raw; struct { __u64 select_lo:8; __u64 match:8; __u64 rsvd1:16; __u64 select_hi:4; __u64 rsvd2:19; __u64 mask:8; __u64 exclude:1; }; } } This becomes more restrictive, but for our use case I believe it will work. The problem with this layout in general is where the event select ends up. You can't put anything below 'select_lo', and 'select_lo' and 'select_hi' straddle the unit mask, so you just have less control over the order things end up. That said, if our goal is to just sort by exclude + event select for masked events we can do that with filter_event_cmp() on the layout above. If we want to find a legacy event we can use cmp_u64() on the layout above as well. The only caveat is you can't mix and match during lookup like you can in workaround #1. You have to sort and search with the same comparer. If we convert legacy events to masked events I'm pretty sure we can still have a common code path. > > > > We might need separate logic for the existing non-masked mechanism, but again > > > that's only more code, IMO it's not more complex. E.g. I believe that the legacy > > > case can be handled with a dedicated: > > > > > > if (!(filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS)) > > > return find_filter_index(..., cmp_u64) > 0; > > > > > > > I'd rather not drop encoding, however, if we did, sorting and > > searching would both have to be special cased. Masked events and > > non-masked events are not compatible with each other that way. If the > > list is sorted for masked events, you can't do a cmp_u64 search on it > > and expect to find anything. > > I'm not sure I follow. As above, doesn't the search for masked vs. non-masked only > differ on the comparator? > > > I suppose you could if using the struct overlay and we included match (which > > doubles as the unit mask for non-masked events) in the comparer and dropped > > indexing. > > Why does "match" need to be in the comparer? I thought the sub-lists are created > for each "eventsel"? > > > But that only works because the struct overlay keeps the event select > > contiguous. The fact that the architectural format has the event select > > straddle the unit mask prevents that from working. > > > > > Oh, and as a bonus of splitting include vs. exclude, the legacy case effectively > > > optimizes exclude since the length of the exclude array will be '0'. > > > > This should apply to both implementations. > > If everything is in a single list, doesn't the exclude lookup need to search the > entire thing to determine that there are no entries? Yes. To do what I'm suggesting would require creating an internal representation of the struct kvm_pmu_event_filter to track the include and exclude portions of the list. But that should be fairly straightforward to set up when loading the filter, then we would be able to search the individual lists rather than the whole thing. Having said that, I'm not saying this is necessary, all I'm saying is that it wouldn't be hard to do. > ... > > > > > struct kvm_pmu_filter_entry { > > > > union { > > > > u64 raw; > > > > struct { > > > > + u64 mask:8; > > > > + u64 match:8; > > > > + u64 event_index:12; > > > > > > This is broken. There are 2^12 possible event_select values, but event_index is > > > the index into the full list of events, i.e. is bounded only by nevents, and so > > > this needs to be stored as a 32-bit value. E.g. if userspace creates a filter > > > with 2^32-2 entries for eventsel==0, then the index for eventsel==1 will be > > > 2^32-1 even though there are only two event_select values in the entire list. > > > > > > > nevents <= KVM_PMU_EVENT_FILTER_MAX_EVENTS (300), so there is plenty > > of space in event_index to cover any valid index. Though a moot point > > if we cut it. > > Oooh. In that case, won't 9 bits suffice? Using 12 implies there's a connection > to eventsel, which is misleading and confusing. Yes, 9 would suffice. I gave some buffer because I liked the idea of round numbers and I had a lot of space in the struct. I didn't see it implying a connection to event_select, but in hindsight I can see why you may have thought that. > > Regardless, if we keep the indexing, there should be a BUILD_BUG_ON() to assert > that the "head" (event_index) field is larger enough to hold the max number of > events. Not sure if there's a way to get the number of bits, i.e. might require > a separate #define to get the compile-time assert. :-/
On Wed, Oct 19, 2022 at 1:06 AM Aaron Lewis <aaronlewis@google.com> wrote: > > > > I think setting it up this way makes the data coming from userspace > > > less error prone and easier for them to work with because the lists > > > are being separated for them into include and exclude, and the order > > > they build it in doesn't matter. > > > > The above said, I agree that building the separate lists would be obnoxious for > > userspace. Userspace would either need to know the length of the "includes" list > > up front, or would have to build the "excludes" list separately and then fold it > > in at the end. Unless we might run out of bits, lets keep the flag approach for > > the uAPI. > > > > > Isn't having the include list pointer exposed to userspace a little > > > awkward? As is I'm not sure why userspace would care about it, and > > > when we load the filter we'd essentially ignore it. > > > > I don't follow. I assume userspace cares about specifying "includes", otherwise > > why provide that functionality? > > Yes, userspace cares about includes, but I'm not sure they will care > about the "include" pointer because we would ignore it when loading a > filter (there's no information it provides we can get from the other > fields) and it would be cumbersome for them to use as you pointed out. > So, I was just guessing that they wouldn't use it. > > > > > I'm not concerned about the setup cost to create the sub-lists, my thoughts are > > purely from a "how easy is it to understand" perspective, closely followed by > > "how performant is the runtime code". IIUC, the runtime performance is basically > > equivalent since the comparison function will need to mask off bits either way. > > > > For the forward+backward, IMO it's easy to understand with a single comment above > > the forward+backward walks: > > > > /* > > * Entries are sorted by eventsel, walk the list in both directions to > > * process all entries with the target eventsel. > > */ > > > > Sorting should probably have a comment too, but critically, understanding how the > > list is sorted doesn't require full understanding of how lookups are processed. > > E.g. a comment like this would suffice for sorting: > > > > /* > > * Sort entries by eventsel so that all entries for a given eventsel can > > * be processed efficiently during filter. > > */ > > > > The sub-list on the other hand requires comments to explain the sub-list concept, > > document the indexing code (and its comparator?), the #define for the head bits > > (and/or struct field), and the walk code code that processes the sublist. > > > > In addition to having more things to document, the comments will be spread out, > > and IMO it's difficult to understand each individual piece without having a good > > grasp of the overall implementation. Using "head" instead of "index" would help > > quite a bit for the lookup+walk, but I think the extra pieces will still leave > > readers wondering "why?. E.g. Why is there a separate comparator for sorting > > vs. lookup? Why is there a "head" field? What problem do sub-lists solve? > > > > I'll drop indexing. > > > > > I don't see why encoding is necessary to achieve a common internal implementation > > though. To make sure we're talking about the same thing, by "encoding" I mean > > having different layouts for the same data in the uAPI struct than the in-kernel > > struct. I'm perfectly ok having bits in the uAPI struct that are dropped when > > building the in-kernel lists, e.g. the EXCLUDE bit. Ditto for having bits in the > > in-kernel struct that don't exist in the uAPI struct, e.g. the "head" (index) of > > the sublist if we go that route. > > > > My objection to "encoding" is moving the eventsel bits around. AFAICT, it's not > > strictly necessary, and similar to the sub-list approach, that raises that question > > of "why?". In all cases, aren't the "eventsel" bits always contained in bits 35:32 > > and 7:0? The comparator can simply mask off all other bits in order to find a > > filter with the specified eventsel, no? > > Strictly speaking, yes... as long as you are aware of some boundaries > and work within them. > > Problem: > -------- > > Given the event select + unit mask pairs: {0x101, 0x1}, {0x101, 0x2}, > {0x102, 0x1} > > If we use the architectural layout we get: 0x10011, 0x10021, 0x10012. > > Sorted by filter_event_cmp(), i.e. event select, it's possible to get: > 0x10012, 0x10011, 0x10021. > > Then if a search for {0x101, 0x2} with cmp_u64() is done it wouldn't > find anything because 0x10021 is deemed less than 0x10011 in this > list. That's why adding find_filter_index(..., cmp_u64) > 0 doesn't > work to special case legacy events. > > Similarly, if sorted by cmp_u64() we get: 0x10021, 0x10012, 0x10011. > > Then a search with filter_event_cmp() for 0x10021 would fail as well. > > Possible workaround #1: > ----------------------- > > If we use the internal layout (removed index). > > struct kvm_pmu_filter_entry { > union { > u64 raw; > struct { > u64 mask:8; > u64 match:8; // doubles as umask, i.e. > encode_filter_entry() > u64 event_select:12; > u64 exclude:1; > u64 rsvd:35; > }; > }; > }; > > Because the event_select is contiguous, all of its bits are above > match, and the fields in kvm_pmu_filter_entry are ordered from most > important to least important from a sort perspective, a sort with > cmp_u64() gives us: 0x10210, 0x10120, 0x10110. > > This order will work for both legacy events and masked events. So, as > you suggested, if we wanted to add the following to > is_gp_event_allowed() we can special case legacy events and early out: > > if (!(filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS)) > return find_filter_index(..., cmp_u64) > 0; > > Or we can leave it as is and the mask events path will come up with > the same result. > > Possible workaround #2: > ----------------------- > > If we use the architectural layout for masked events, e.g.: > > struct kvm_pmu_event_filter_entry { > union { > __u64 raw; > struct { > __u64 select_lo:8; > __u64 match:8; > __u64 rsvd1:16; > __u64 select_hi:4; > __u64 rsvd2:19; > __u64 mask:8; > __u64 exclude:1; > }; > } > } > > This becomes more restrictive, but for our use case I believe it will > work. The problem with this layout in general is where the event > select ends up. You can't put anything below 'select_lo', and > 'select_lo' and 'select_hi' straddle the unit mask, so you just have > less control over the order things end up. That said, if our goal is > to just sort by exclude + event select for masked events we can do > that with filter_event_cmp() on the layout above. If we want to find > a legacy event we can use cmp_u64() on the layout above as well. The > only caveat is you can't mix and match during lookup like you can in > workaround #1. You have to sort and search with the same comparer. > > If we convert legacy events to masked events I'm pretty sure we can > still have a common code path. > > > > > > > We might need separate logic for the existing non-masked mechanism, but again > > > > that's only more code, IMO it's not more complex. E.g. I believe that the legacy > > > > case can be handled with a dedicated: > > > > > > > > if (!(filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS)) > > > > return find_filter_index(..., cmp_u64) > 0; > > > > > > > > > > I'd rather not drop encoding, however, if we did, sorting and > > > searching would both have to be special cased. Masked events and > > > non-masked events are not compatible with each other that way. If the > > > list is sorted for masked events, you can't do a cmp_u64 search on it > > > and expect to find anything. > > > > I'm not sure I follow. As above, doesn't the search for masked vs. non-masked only > > differ on the comparator? > > > > > I suppose you could if using the struct overlay and we included match (which > > > doubles as the unit mask for non-masked events) in the comparer and dropped > > > indexing. > > > > Why does "match" need to be in the comparer? I thought the sub-lists are created > > for each "eventsel"? > > > > > But that only works because the struct overlay keeps the event select > > > contiguous. The fact that the architectural format has the event select > > > straddle the unit mask prevents that from working. > > > > > > > Oh, and as a bonus of splitting include vs. exclude, the legacy case effectively > > > > optimizes exclude since the length of the exclude array will be '0'. > > > > > > This should apply to both implementations. > > > > If everything is in a single list, doesn't the exclude lookup need to search the > > entire thing to determine that there are no entries? > > Yes. To do what I'm suggesting would require creating an internal > representation of the struct kvm_pmu_event_filter to track the include > and exclude portions of the list. But that should be fairly > straightforward to set up when loading the filter, then we would be > able to search the individual lists rather than the whole thing. > Having said that, I'm not saying this is necessary, all I'm saying is > that it wouldn't be hard to do. > > > ... > > > > > > > struct kvm_pmu_filter_entry { > > > > > union { > > > > > u64 raw; > > > > > struct { > > > > > + u64 mask:8; > > > > > + u64 match:8; > > > > > + u64 event_index:12; > > > > > > > > This is broken. There are 2^12 possible event_select values, but event_index is > > > > the index into the full list of events, i.e. is bounded only by nevents, and so > > > > this needs to be stored as a 32-bit value. E.g. if userspace creates a filter > > > > with 2^32-2 entries for eventsel==0, then the index for eventsel==1 will be > > > > 2^32-1 even though there are only two event_select values in the entire list. > > > > > > > > > > nevents <= KVM_PMU_EVENT_FILTER_MAX_EVENTS (300), so there is plenty > > > of space in event_index to cover any valid index. Though a moot point > > > if we cut it. > > > > Oooh. In that case, won't 9 bits suffice? Using 12 implies there's a connection > > to eventsel, which is misleading and confusing. > > Yes, 9 would suffice. I gave some buffer because I liked the idea of > round numbers and I had a lot of space in the struct. I didn't see it > implying a connection to event_select, but in hindsight I can see why > you may have thought that. > > > > > Regardless, if we keep the indexing, there should be a BUILD_BUG_ON() to assert > > that the "head" (event_index) field is larger enough to hold the max number of > > events. Not sure if there's a way to get the number of bits, i.e. might require > > a separate #define to get the compile-time assert. :-/ Here's what I came up with. Let me know if this is what you were thinking: arch/x86/include/uapi/asm/kvm.h #define KVM_PMU_EVENT_FLAG_MASKED_EVENTS BIT(0) #define KVM_PMU_EVENT_FLAGS_VALID_MASK (KVM_PMU_EVENT_FLAG_MASKED_EVENTS) /* * Masked event layout. * Bits Description * ---- ----------- * 7:0 event select (low bits) * 15:8 umask match * 31:16 unused * 35:32 event select (high bits) * 36:54 unused * 55 exclude bit * 63:56 umask mask */ #define KVM_PMU_ENCODE_MASKED_ENTRY(event_select, mask, match, exclude) \ (((event_select) & 0xFFULL) | (((event_select) & 0XF00ULL) << 24) | \ (((mask) & 0xFFULL) << 56) | \ (((match) & 0xFFULL) << 8) | \ ((__u64)(!!(exclude)) << 55)) #define KVM_PMU_MASKED_ENTRY_EVENT_SELECT \ (GENMASK_ULL(7, 0) | GENMASK_ULL(35, 32)) #define KVM_PMU_MASKED_ENTRY_UMASK_MASK (GENMASK_ULL(63, 56)) #define KVM_PMU_MASKED_ENTRY_UMASK_MATCH (GENMASK_ULL(15, 8)) #define KVM_PMU_MASKED_ENTRY_EXCLUDE (BIT_ULL(55)) #define KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT (56) arch/x86/include/asm/kvm_host.h struct kvm_x86_pmu_event_filter { __u32 action; __u32 nevents; __u32 fixed_counter_bitmap; __u32 flags; __u32 nr_includes; __u32 nr_excludes; __u64 *includes; __u64 *excludes; __u64 events[]; }; - struct kvm_pmu_event_filter __rcu *pmu_event_filter; + struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter; arch/x86/kvm/pmu.c static int filter_sort_cmp(const void *pa, const void *pb) { u64 a = *(u64 *)pa & (KVM_PMU_MASKED_ENTRY_EVENT_SELECT | KVM_PMU_MASKED_ENTRY_EXCLUDE); u64 b = *(u64 *)pb & (KVM_PMU_MASKED_ENTRY_EVENT_SELECT | KVM_PMU_MASKED_ENTRY_EXCLUDE); return (a > b) - (a < b); } /* * For the event filter, searching is done on the 'includes' list and * 'excludes' list separately rather than on the 'events' list (which * has both). As a result the exclude bit can be ignored. */ static int filter_event_cmp(const void *pa, const void *pb) { u64 a = *(u64 *)pa & KVM_PMU_MASKED_ENTRY_EVENT_SELECT; u64 b = *(u64 *)pb & KVM_PMU_MASKED_ENTRY_EVENT_SELECT; return (a > b) - (a < b); } static int find_filter_index(u64 *events, u64 nevents, u64 key) { u64 *fe = bsearch(&key, events, nevents, sizeof(events[0]), filter_event_cmp); if (!fe) return -1; return fe - events; } static bool is_filter_entry_match(u64 filter_event, u64 umask) { u64 mask = filter_event >> (KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT - 8); u64 match = filter_event & KVM_PMU_MASKED_ENTRY_UMASK_MATCH; BUILD_BUG_ON((KVM_PMU_ENCODE_MASKED_ENTRY(0, 0xff, 0, false) >> (KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT - 8)) != ARCH_PERFMON_EVENTSEL_UMASK); return (umask & mask) == match; } static bool filter_contains_match(u64 *events, u64 nevents, u64 eventsel) { u64 event_select = eventsel & kvm_pmu_ops.EVENTSEL_EVENT; u64 umask = eventsel & ARCH_PERFMON_EVENTSEL_UMASK; int i, index; index = find_filter_index(events, nevents, event_select); if (index < 0) return false; /* * Entries are sorted by the event select. Walk the list in both * directions to process all entries with the targeted event select. */ for (i = index; i < nevents; i++) { if (filter_event_cmp(&events[i], &event_select) != 0) break; if (is_filter_entry_match(events[i], umask)) return true; } for (i = index - 1; i >= 0; i--) { if (filter_event_cmp(&events[i], &event_select) != 0) break; if (is_filter_entry_match(events[i], umask)) return true; } return false; } static bool is_gp_event_allowed(struct kvm_x86_pmu_event_filter *filter, u64 eventsel) { if (filter_contains_match(filter->includes, filter->nr_includes, eventsel) && !filter_contains_match(filter->excludes, filter->nr_excludes, eventsel)) return filter->action == KVM_PMU_EVENT_ALLOW; return filter->action == KVM_PMU_EVENT_DENY; } < All the code above here is for filtering the guest eventsel. > < All the code below here is for validating and loading the filter. > static bool is_filter_valid(struct kvm_x86_pmu_event_filter *filter) { u64 mask; int i; /* To maintain backwards compatibility only validate masked events. */ if (filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS) { mask = kvm_pmu_ops.EVENTSEL_EVENT | KVM_PMU_MASKED_ENTRY_UMASK_MASK | KVM_PMU_MASKED_ENTRY_UMASK_MATCH | KVM_PMU_MASKED_ENTRY_EXCLUDE; for (i = 0; i < filter->nevents; i++) { if (filter->events[i] & ~mask) return false; } } return true; } static void prepare_filter_events(struct kvm_x86_pmu_event_filter *filter) { int i, j; if (filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS) return; for (i = 0, j = 0; i < filter->nevents; i++) { /* * Skip events that are impossible to match against a guest * event. When filtering, only the event select + unit mask * of the guest event is used. */ if (filter->events[i] & ~(kvm_pmu_ops.EVENTSEL_EVENT | ARCH_PERFMON_EVENTSEL_UMASK)) continue; /* * Convert userspace events to a common in-kernel event so * only one code path is needed to support both events. For * the in-kernel events use masked events because they are * flexible enough to handle both cases. To convert to masked * events all that's needed is to add the umask_mask. */ filter->events[j++] = filter->events[i] | (0xFFULL << KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT); } filter->nevents = j; } static void setup_filter_lists(struct kvm_x86_pmu_event_filter *filter) { int i; for (i = 0; i < filter->nevents; i++) { if(filter->events[i] & KVM_PMU_MASKED_ENTRY_EXCLUDE) break; } filter->nr_includes = i; filter->nr_excludes = filter->nevents - filter->nr_includes; filter->includes = filter->events; filter->excludes = filter->events + filter->nr_includes; } int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp) { - struct kvm_pmu_event_filter tmp, *filter; + struct kvm_pmu_event_filter __user *user_filter = argp; + struct kvm_x86_pmu_event_filter *filter; + struct kvm_pmu_event_filter tmp; size_t size; int r; - if (copy_from_user(&tmp, argp, sizeof(tmp))) + if (copy_from_user(&tmp, user_filter, sizeof(tmp))) return -EFAULT; if (tmp.action != KVM_PMU_EVENT_ALLOW && tmp.action != KVM_PMU_EVENT_DENY) return -EINVAL; - if (tmp.flags != 0) + if (tmp.flags & ~KVM_PMU_EVENT_FLAGS_VALID_MASK) return -EINVAL; if (tmp.nevents > KVM_PMU_EVENT_FILTER_MAX_EVENTS) return -E2BIG; size = struct_size(filter, events, tmp.nevents); - filter = kmalloc(size, GFP_KERNEL_ACCOUNT); + filter = kzalloc(size, GFP_KERNEL_ACCOUNT); if (!filter) return -ENOMEM; + filter->action = tmp.action; + filter->nevents = tmp.nevents; + filter->fixed_counter_bitmap = tmp.fixed_counter_bitmap; + filter->flags = tmp.flags; + r = -EFAULT; - if (copy_from_user(filter, argp, size)) + if (copy_from_user(filter->events, user_filter->events, + sizeof(filter->events[0]) * filter->nevents)) goto cleanup; - /* Restore the verified state to guard against TOCTOU attacks. */ - *filter = tmp; + r = -EINVAL; + if (!is_filter_valid(filter)) + goto cleanup; - remove_impossible_events(filter); + prepare_filter_events(filter); /* - * Sort the in-kernel list so that we can search it with bsearch. + * Sort entries by event select so that all entries for a given + * event select can be processed efficiently during filtering. */ - sort(&filter->events, filter->nevents, sizeof(__u64), cmp_u64, NULL); + sort(&filter->events, filter->nevents, sizeof(filter->events[0]), + filter_sort_cmp, NULL); + + setup_filter_lists(filter);
On Fri, Oct 21, 2022, Aaron Lewis wrote: > Here's what I came up with. Let me know if this is what you were thinking: A purely mechanical suggestions, but overall looks awesome! > static int filter_sort_cmp(const void *pa, const void *pb) > { > u64 a = *(u64 *)pa & (KVM_PMU_MASKED_ENTRY_EVENT_SELECT | > KVM_PMU_MASKED_ENTRY_EXCLUDE); > u64 b = *(u64 *)pb & (KVM_PMU_MASKED_ENTRY_EVENT_SELECT | > KVM_PMU_MASKED_ENTRY_EXCLUDE); > > return (a > b) - (a < b); > } > > /* > * For the event filter, searching is done on the 'includes' list and > * 'excludes' list separately rather than on the 'events' list (which > * has both). As a result the exclude bit can be ignored. > */ > static int filter_event_cmp(const void *pa, const void *pb) > { > u64 a = *(u64 *)pa & KVM_PMU_MASKED_ENTRY_EVENT_SELECT; > u64 b = *(u64 *)pb & KVM_PMU_MASKED_ENTRY_EVENT_SELECT; > > return (a > b) - (a < b); > } To dedup code slightly and make this a little more readable, what about adding a common helper to do the compare? That also makes it quite obvious that the only difference is the inclusion (heh) of the EXCLUDE flag. static int filter_cmp(u64 *pa, u64 *pb, u64 mask) { u64 a = *pa & mask; u64 b = *pb & mask; return (a > b) - (a < b); } static int filter_sort_cmp(const void *pa, const void *pb) { return filter_cmp(pa, pb, (KVM_PMU_MASKED_ENTRY_EVENT_SELECT | KVM_PMU_MASKED_ENTRY_EXCLUDE); } /* * For the event filter, searching is done on the 'includes' list and * 'excludes' list separately rather than on the 'events' list (which * has both). As a result the exclude bit can be ignored. */ static int filter_event_cmp(const void *pa, const void *pb) { return filter_cmp(pa, pb, (KVM_PMU_MASKED_ENTRY_EVENT_SELECT); } > static bool filter_contains_match(u64 *events, u64 nevents, u64 eventsel) > { > u64 event_select = eventsel & kvm_pmu_ops.EVENTSEL_EVENT; > u64 umask = eventsel & ARCH_PERFMON_EVENTSEL_UMASK; > int i, index; > > index = find_filter_index(events, nevents, event_select); > if (index < 0) > return false; > > /* > * Entries are sorted by the event select. Walk the list in both > * directions to process all entries with the targeted event select. > */ > for (i = index; i < nevents; i++) { > if (filter_event_cmp(&events[i], &event_select) != 0) Preferred kernel style is to omit comparisons against zero, i.e. just if (filter_event_cmp(&events[i], &event_select)) break; > break; > > if (is_filter_entry_match(events[i], umask)) > return true; > } > > for (i = index - 1; i >= 0; i--) { > if (filter_event_cmp(&events[i], &event_select) != 0) > break; > > if (is_filter_entry_match(events[i], umask)) > return true; > } > > return false; > } > > static bool is_gp_event_allowed(struct kvm_x86_pmu_event_filter *filter, > u64 eventsel) > { > if (filter_contains_match(filter->includes, > filter->nr_includes, eventsel) && > !filter_contains_match(filter->excludes, > filter->nr_excludes, eventsel)) > return filter->action == KVM_PMU_EVENT_ALLOW; > > return filter->action == KVM_PMU_EVENT_DENY; Might be worth using a single char for the filter param, e.g. 'f' yields: static bool is_gp_event_allowed(struct kvm_x86_pmu_event_filter *f, u64 eventsel) { if (filter_contains_match(f->includes, f->nr_includes, eventsel) && !filter_contains_match(f->excludes, f->nr_excludes, eventsel)) return f->action == KVM_PMU_EVENT_ALLOW; return f->action == KVM_PMU_EVENT_DENY; } > static void setup_filter_lists(struct kvm_x86_pmu_event_filter *filter) > { > int i; > > for (i = 0; i < filter->nevents; i++) { > if(filter->events[i] & KVM_PMU_MASKED_ENTRY_EXCLUDE)a Space after the if if (filter-> ...) > break; > } > > filter->nr_includes = i; > filter->nr_excludes = filter->nevents - filter->nr_includes; > filter->includes = filter->events; > filter->excludes = filter->events + filter->nr_includes; > } > ... > static void prepare_filter_events(struct kvm_x86_pmu_event_filter *filter) > { > int i, j; > > if (filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS) > return; > > for (i = 0, j = 0; i < filter->nevents; i++) { > /* > * Skip events that are impossible to match against a guest > * event. When filtering, only the event select + unit mask > * of the guest event is used. This is a good place for calling out that impossible filters can't be rejected for backwards compatibility reasons. > */ > if (filter->events[i] & ~(kvm_pmu_ops.EVENTSEL_EVENT | > ARCH_PERFMON_EVENTSEL_UMASK)) > continue; > > /* > * Convert userspace events to a common in-kernel event so > * only one code path is needed to support both events. For > * the in-kernel events use masked events because they are > * flexible enough to handle both cases. To convert to masked > * events all that's needed is to add the umask_mask. I think it's worth calling out this creates an "all ones" umask_mask, and that EXCLUDE isn't supported. > */ > filter->events[j++] = > filter->events[i] | > (0xFFULL << KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT); > } > > filter->nevents = j; > } ... > - /* Restore the verified state to guard against TOCTOU attacks. */ > - *filter = tmp; > + r = -EINVAL; > + if (!is_filter_valid(filter)) > + goto cleanup; > > - remove_impossible_events(filter); > + prepare_filter_events(filter); > > /* > - * Sort the in-kernel list so that we can search it with bsearch. > + * Sort entries by event select so that all entries for a given > + * event select can be processed efficiently during filtering. > */ > - sort(&filter->events, filter->nevents, sizeof(__u64), cmp_u64, NULL); > + sort(&filter->events, filter->nevents, sizeof(filter->events[0]), > + filter_sort_cmp, NULL); > + > + setup_filter_lists(filter); The sort+setup should definitely go in a single helper. Rather than have the helpers deal with masked vs. legacy, what about putting that logic in a top-level helper? Then this code is simply: r = prepare_filter_lists(filter); if (r) goto cleanup; And the helper names can be more explicit, i.e. can call out that they validate a masked filter and convert to a masked filter. E.g. (completely untested) static bool is_masked_filter_valid(const struct kvm_x86_pmu_event_filter *filter) { u64 mask = kvm_pmu_ops.EVENTSEL_EVENT | KVM_PMU_MASKED_ENTRY_UMASK_MASK | KVM_PMU_MASKED_ENTRY_UMASK_MATCH | KVM_PMU_MASKED_ENTRY_EXCLUDE; int i; for (i = 0; i < filter->nevents; i++) { if (filter->events[i] & ~mask) return false; } return true; } static void convert_to_masked_filter(struct kvm_x86_pmu_event_filter *filter) { int i, j; for (i = 0, j = 0; i < filter->nevents; i++) { /* * Skip events that are impossible to match against a guest * event. When filtering, only the event select + unit mask * of the guest event is used. To maintain backwards * compatibility, impossible filters can't be rejected :-( */ if (filter->events[i] & ~(kvm_pmu_ops.EVENTSEL_EVENT | ARCH_PERFMON_EVENTSEL_UMASK)) continue; /* * Convert userspace events to a common in-kernel event so * only one code path is needed to support both events. For * the in-kernel events use masked events because they are * flexible enough to handle both cases. To convert to masked * events all that's needed is to add an "all ones" umask_mask, * (unmasked filter events don't support EXCLUDE). */ filter->events[j++] = filter->events[i] | (0xFFULL << KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT); } filter->nevents = j; } static int prepare_filter_lists(struct kvm_x86_pmu_event_filter *filter) { int i; if (!(filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS) convert_to_masked_filter(filter) else if (!is_masked_filter_valid(filter)) return -EINVAL; /* * Sort entries by event select and includes vs. excludes so that all * entries for a given event select can be processed efficiently during * filtering. The EXCLUDE flag uses a more significant bit than the * event select, and so the sorted list is also effectively split into * includes and excludes sub-lists. */ sort(&filter->events, filter->nevents, sizeof(filter->events[0]), filter_sort_cmp, NULL); /* Find the first EXCLUDE event (only supported for masked events). */ if (filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS) { for (i = 0; i < filter->nevents; i++) { if (filter->events[i] & KVM_PMU_MASKED_ENTRY_EXCLUDE) break; } } filter->nr_includes = i; filter->nr_excludes = filter->nevents - filter->nr_includes; filter->includes = filter->events; filter->excludes = filter->events + filter->nr_includes; }
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index abd7c32126ce..16c3e6fd4ed7 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -5029,6 +5029,15 @@ using this ioctl. :Parameters: struct kvm_pmu_event_filter (in) :Returns: 0 on success, -1 on error +Errors: + + ====== ============================================================ + EFAULT args[0] cannot be accessed + EINVAL args[0] contains invalid data in the filter or filter events + E2BIG nevents is too large + EBUSY not enough memory to allocate the filter + ====== ============================================================ + :: struct kvm_pmu_event_filter { @@ -5040,14 +5049,73 @@ using this ioctl. __u64 events[0]; }; -This ioctl restricts the set of PMU events that the guest can program. -The argument holds a list of events which will be allowed or denied. -The eventsel+umask of each event the guest attempts to program is compared -against the events field to determine whether the guest should have access. -The events field only controls general purpose counters; fixed purpose -counters are controlled by the fixed_counter_bitmap. +This ioctl restricts the set of PMU events the guest can program by limiting +which event select and unit mask combinations are permitted. + +The argument holds a list of filter events which will be allowed or denied. + +Filter events only control general purpose counters; fixed purpose counters +are controlled by the fixed_counter_bitmap. + +Valid values for 'flags':: + +``0`` + +To use this mode, clear the 'flags' field. + +In this mode each filter event will contain an event select + unit mask. + +When the guest attempts to program the PMU the event select + unit mask from +the event select register being programmed is compared against the filter +events to determine whether the guest should have access. + +``KVM_PMU_EVENT_FLAG_MASKED_EVENTS`` +:Capability: KVM_CAP_PMU_EVENT_MASKED_EVENTS + +In this mode each filter event will contain an event select, mask, match, and +exclude value. + +When the guest attempts to program the PMU, these steps are followed in +determining if the guest should have access: + 1. Match the event select from the guest against the filter events. + 2. If a match is found, match the guest's unit mask to the mask and match + values of the included filter events. + I.e. (unit mask & mask) == match && !exclude. + 3. If a match is found, match the guest's unit mask to the mask and match + values of the excluded filter events. + I.e. (unit mask & mask) == match && exclude. + 4. + a. If an included match is found and an excluded match is not found, filter + the event. + b. For everything else, do not filter the event. + 5. + a. If the event is filtered and it's an allow list, allow the guest to + program the event. + b. If the event is filtered and it's a deny list, do not allow the guest to + program the event. + +To encode a filter event use: + KVM_PMU_EVENT_ENCODE_MASKED_ENTRY(). + +To access individual components in a masked entry use: +:: + struct kvm_pmu_event_masked_entry { + union { + __u64 raw; + struct { + /* event_select bits 11:8 are only valid on AMD. */ + __u64 event_select:12; + __u64 mask:8; + __u64 match:8; + __u64 exclude:1; + __u64 rsvd:35; + }; + }; + }; -No flags are defined yet, the field must be zero. +-EINVAL will be returned if the 'rsvd' field is not zero or if any of +the high bits (11:8) in the 'event_select' field are set when called +on Intel. Valid values for 'action':: diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index 46de10a809ec..c82400a06c8b 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -528,6 +528,34 @@ struct kvm_pmu_event_filter { #define KVM_PMU_EVENT_ALLOW 0 #define KVM_PMU_EVENT_DENY 1 +#define KVM_PMU_EVENT_FLAG_MASKED_EVENTS BIT(0) +#define KVM_PMU_EVENT_FLAGS_VALID_MASK (KVM_PMU_EVENT_FLAG_MASKED_EVENTS) + +struct kvm_pmu_event_masked_entry { + union { + __u64 raw; + struct { + /* event_select bits 11:8 are only valid on AMD. */ + __u64 event_select:12; + __u64 mask:8; + __u64 match:8; + __u64 exclude:1; + __u64 rsvd:35; + }; + }; +}; + +#define KVM_PMU_EVENT_ENCODE_MASKED_ENTRY(event_select, mask, match, exclude) \ + (((event_select) & 0xFFFULL) | \ + (((mask) & 0xFFULL) << 12) | \ + (((match) & 0xFFULL) << 20) | \ + ((__u64)(!!(exclude)) << 28)) + +#define KVM_PMU_EVENT_MASKED_EVENTSEL_EVENT (GENMASK_ULL(11, 0)) +#define KVM_PMU_EVENT_MASKED_EVENTSEL_MASK (GENMASK_ULL(19, 12)) +#define KVM_PMU_EVENT_MASKED_EVENTSEL_MATCH (GENMASK_ULL(27, 20)) +#define KVM_PMU_EVENT_MASKED_EVENTSEL_EXCLUDE (BIT_ULL(28)) + /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */ #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */ #define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */ diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 7ce8bfafea91..b188ddb23f75 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -252,34 +252,61 @@ static inline u8 get_unit_mask(u64 eventsel) return (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8; } -static int cmp_u64(const void *pa, const void *pb) +static int filter_event_cmp(const void *pa, const void *pb) { - u64 a = *(u64 *)pa; - u64 b = *(u64 *)pb; + u64 a = *(u64 *)pa & (KVM_PMU_FILTER_EVENTSEL_INDEX | + KVM_PMU_FILTER_EVENTSEL_EVENT | + KVM_PMU_FILTER_EVENTSEL_EXCLUDE); + u64 b = *(u64 *)pb & (KVM_PMU_FILTER_EVENTSEL_INDEX | + KVM_PMU_FILTER_EVENTSEL_EVENT | + KVM_PMU_FILTER_EVENTSEL_EXCLUDE); return (a > b) - (a < b); } -static u64 *find_filter_entry(struct kvm_pmu_event_filter *filter, u64 key) +static int find_filter_index(struct kvm_pmu_event_filter *filter, u64 key) { - return bsearch(&key, filter->events, filter->nevents, - sizeof(filter->events[0]), cmp_u64); + u64 *fe = bsearch(&key, filter->events, filter->nevents, + sizeof(filter->events[0]), filter_event_cmp); + + if (!fe) + return -1; + + return fe - filter->events; } static bool filter_contains_match(struct kvm_pmu_event_filter *filter, - u64 eventsel) + u64 eventsel, bool exclude) { u16 event_select = get_event_select(eventsel); u8 unit_mask = get_unit_mask(eventsel); + struct kvm_pmu_filter_entry fe; + int i, index; u64 key; - key = KVM_PMU_ENCODE_FILTER_ENTRY(event_select, unit_mask); - return find_filter_entry(filter, key); + key = KVM_PMU_ENCODE_FILTER_ENTRY(event_select, 0, 0, exclude); + index = find_filter_index(filter, key); + if (index < 0) + return false; + + for (i = index; i < filter->nevents; i++) { + fe.raw = filter->events[i]; + + if (fe.event_select != event_select || + fe.exclude != exclude) + break; + + if ((unit_mask & fe.mask) == fe.match) + return true; + } + + return false; } static bool is_gp_event_allowed(struct kvm_pmu_event_filter *filter, u64 eventsel) { - if (filter_contains_match(filter, eventsel)) + if (filter_contains_match(filter, eventsel, false) && + !filter_contains_match(filter, eventsel, true)) return filter->action == KVM_PMU_EVENT_ALLOW; return filter->action == KVM_PMU_EVENT_DENY; @@ -598,11 +625,20 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id) } EXPORT_SYMBOL_GPL(kvm_pmu_trigger_event); -static inline u64 get_event_filter_mask(void) +static inline u64 get_event_filter_mask(u32 flags) { u64 event_select_mask = static_call(kvm_x86_pmu_get_eventsel_event_mask)(); + if (flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS) { + u64 masked_eventsel_event = get_event_select(event_select_mask); + + return masked_eventsel_event | + KVM_PMU_EVENT_MASKED_EVENTSEL_MASK | + KVM_PMU_EVENT_MASKED_EVENTSEL_MATCH | + KVM_PMU_EVENT_MASKED_EVENTSEL_EXCLUDE; + } + return event_select_mask | ARCH_PERFMON_EVENTSEL_UMASK; } @@ -611,6 +647,29 @@ static inline bool is_event_valid(u64 event, u64 mask) return !(event & ~mask); } +static bool is_filter_valid(struct kvm_pmu_event_filter *filter) +{ + u64 event_mask; + int i; + + /* To maintain backwards compatibility don't validate raw events. */ + if (!filter->flags) + return true; + + event_mask = get_event_filter_mask(filter->flags); + for (i = 0; i < filter->nevents; i++) { + if (!is_event_valid(filter->events[i], event_mask)) + return false; + } + + for (i = 0; i < ARRAY_SIZE(filter->pad); i++) { + if (filter->pad[i]) + return false; + } + + return true; +} + static void remove_invalid_raw_events(struct kvm_pmu_event_filter *filter) { u64 raw_mask; @@ -619,7 +678,7 @@ static void remove_invalid_raw_events(struct kvm_pmu_event_filter *filter) if (filter->flags) return; - raw_mask = get_event_filter_mask(); + raw_mask = get_event_filter_mask(filter->flags); for (i = 0, j = 0; i < filter->nevents; i++) { u64 raw_event = filter->events[i]; @@ -630,12 +689,27 @@ static void remove_invalid_raw_events(struct kvm_pmu_event_filter *filter) filter->nevents = j; } -static inline u64 encode_filter_entry(u64 event) +static inline u64 encode_filter_entry(u64 event, u32 flags) { - u16 event_select = get_event_select(event); - u8 unit_mask = get_unit_mask(event); + u16 event_select; + u8 mask, match; + bool exclude; - return KVM_PMU_ENCODE_FILTER_ENTRY(event_select, unit_mask); + if (flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS) { + struct kvm_pmu_event_masked_entry me = { .raw = event }; + + mask = me.mask; + match = me.match; + event_select = me.event_select; + exclude = me.exclude; + } else { + mask = 0xFF; + match = get_unit_mask(event); + event_select = get_event_select(event); + exclude = false; + } + + return KVM_PMU_ENCODE_FILTER_ENTRY(event_select, mask, match, exclude); } static void convert_to_filter_events(struct kvm_pmu_event_filter *filter) @@ -645,7 +719,34 @@ static void convert_to_filter_events(struct kvm_pmu_event_filter *filter) for (i = 0; i < filter->nevents; i++) { u64 e = filter->events[i]; - filter->events[i] = encode_filter_entry(e); + filter->events[i] = encode_filter_entry(e, filter->flags); + } +} + +/* + * Sort will order the list by exclude, then event select. This function will + * then index the sublists of event selects such that when a search is done on + * the list, the head of the event select sublist is returned. This simplifies + * the logic in filter_contains_match() when walking the list. + */ +static void index_filter_events(struct kvm_pmu_event_filter *filter) +{ + struct kvm_pmu_filter_entry *prev, *curr; + int i, index = 0; + + if (filter->nevents) + prev = (struct kvm_pmu_filter_entry *)(filter->events); + + for (i = 0; i < filter->nevents; i++) { + curr = (struct kvm_pmu_filter_entry *)(&filter->events[i]); + + if (curr->event_select != prev->event_select || + curr->exclude != prev->exclude) { + index = 0; + prev = curr; + } + + curr->event_index = index++; } } @@ -659,7 +760,9 @@ static void prepare_filter_events(struct kvm_pmu_event_filter *filter) * Sort the in-kernel list so that we can search it with bsearch. */ sort(&filter->events, filter->nevents, sizeof(filter->events[0]), - cmp_u64, NULL); + filter_event_cmp, NULL); + + index_filter_events(filter); } int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp) @@ -675,7 +778,7 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp) tmp.action != KVM_PMU_EVENT_DENY) return -EINVAL; - if (tmp.flags != 0) + if (tmp.flags & ~KVM_PMU_EVENT_FLAGS_VALID_MASK) return -EINVAL; if (tmp.nevents > KVM_PMU_EVENT_FILTER_MAX_EVENTS) @@ -693,6 +796,10 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp) /* Ensure nevents can't be changed between the user copies. */ *filter = tmp; + r = -EINVAL; + if (!is_filter_valid(filter)) + goto cleanup; + prepare_filter_events(filter); mutex_lock(&kvm->lock); diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index df4f81e5c685..ffc07e4d8d71 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -206,19 +206,41 @@ bool is_vmware_backdoor_pmc(u32 pmc_idx); extern struct kvm_pmu_ops intel_pmu_ops; extern struct kvm_pmu_ops amd_pmu_ops; +#define KVM_PMU_FILTER_EVENTSEL_INDEX (GENMASK_ULL(27, 16)) +#define KVM_PMU_FILTER_EVENTSEL_EVENT (GENMASK_ULL(39, 28)) +#define KVM_PMU_FILTER_EVENTSEL_EXCLUDE (BIT_ULL(40)) + +/* + * Internal representation by which all filter events converge (e.g. masked + * events, raw events). That way there is only one way filter events + * behave once the filter is set. + * + * When filter events are converted into this format then sorted, the + * resulting list naturally ends up in two sublists. One for the 'include + * list' and one for the 'exclude list'. These sublists are further broken + * down into sublists ordered by their event select. After that, the + * event select sublists are indexed such that a search for: exclude = n, + * event_select = n, and event_index = 0 will return the head of an event + * select sublist that can be walked to see if a match exists. + */ struct kvm_pmu_filter_entry { union { u64 raw; struct { + u64 mask:8; + u64 match:8; + u64 event_index:12; u64 event_select:12; - u64 unit_mask:8; - u64 rsvd:44; + u64 exclude:1; + u64 rsvd:23; }; }; }; -#define KVM_PMU_ENCODE_FILTER_ENTRY(event_select, unit_mask) \ - (((event_select) & 0xFFFULL) | \ - (((unit_mask) & 0xFFULL) << 12)) +#define KVM_PMU_ENCODE_FILTER_ENTRY(event_select, mask, match, exclude) \ + (((mask) & 0xFFULL) | \ + (((match) & 0xFFULL) << 8) | \ + (((event_select) & 0xFFFULL) << 28) | \ + ((u64)(!!(exclude)) << 40)) #endif /* __KVM_X86_PMU_H */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 43a6a7efc6ec..0a6c5e1aca0f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4374,6 +4374,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SPLIT_IRQCHIP: case KVM_CAP_IMMEDIATE_EXIT: case KVM_CAP_PMU_EVENT_FILTER: + case KVM_CAP_PMU_EVENT_MASKED_EVENTS: case KVM_CAP_GET_MSR_FEATURES: case KVM_CAP_MSR_PLATFORM_INFO: case KVM_CAP_EXCEPTION_PAYLOAD: diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index eed0315a77a6..685034baea9d 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1177,6 +1177,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_VM_DISABLE_NX_HUGE_PAGES 220 #define KVM_CAP_S390_ZPCI_OP 221 #define KVM_CAP_S390_CPU_TOPOLOGY 222 +#define KVM_CAP_PMU_EVENT_MASKED_EVENTS 223 #ifdef KVM_CAP_IRQ_ROUTING