Message ID | 20240223192658.45893-2-rrameshbabu@nvidia.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | ethtool HW timestamping statistics | expand |
On 2/23/2024 11:24 AM, Rahul Rameshbabu wrote: > +/** > + * struct ethtool_ts_stats - HW timestamping statistics > + * @layer: input field denoting whether stats should be queried from the DMA or > + * PHY timestamping layer. Defaults to the active layer for packet > + * timestamping. > + * @tx_stats: struct group for TX HW timestamping > + * @pkts: Number of packets successfully timestamped by the queried > + * layer. > + * @lost: Number of packet timestamps that failed to get applied on a > + * packet by the queried layer. > + * @late: Number of packet timestamps that were delivered by the > + * hardware but were lost due to arriving too late. > + * @err: Number of timestamping errors that occurred on the queried > + * layer. > + */ > +struct ethtool_ts_stats { > + enum ethtool_ts_stats_layer layer; > + struct_group(tx_stats, > + u64 pkts; > + u64 lost; > + u64 late; > + u64 err; > + ); > +}; The Intel ice drivers has the following Tx timestamp statistics: tx_hwtstamp_skipped - indicates when we get a Tx timestamp request but are unable to fulfill it. tx_hwtstamp_timeouts - indicates we had a Tx timestamp skb waiting for a timestamp from hardware but it didn't get received within some internal time limit. tx_hwtstamp_flushed - indicates that we flushed an outstanding timestamp before it completed, such as if the link resets or similar. tx_hwtstamp_discarded - indicates that we obtained a timestamp from hardware but were unable to complete it due to invalid cached data used for timestamp extension. I think these could be translated roughly to one of the lost, late, or err stats. I am a bit confused as to how drivers could distinguish between lost and late, but I guess that depends on the specific hardware design. In theory we could keep some of these more detailed stats but I don't think we strictly need to be as detailed as the ice driver is. The only major addition I think is the skipped stat, which I would prefer to have. Perhaps that could be tracked in the netdev layer by checking whether the skb flags to see whether or not the driver actually set the appropriate flag? I think i can otherwise translate the flushed status to the lost category, the timeout to the late category, and everything else to the error category. I can easily add a counter to track completed timestamps as well. TL;DR; I would like to see a "skipped" category since I think that should be distinguished from general errors. Thanks!
On Fri, 23 Feb, 2024 13:07:08 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: > On 2/23/2024 11:24 AM, Rahul Rameshbabu wrote: >> +/** >> + * struct ethtool_ts_stats - HW timestamping statistics >> + * @layer: input field denoting whether stats should be queried from the DMA or >> + * PHY timestamping layer. Defaults to the active layer for packet >> + * timestamping. >> + * @tx_stats: struct group for TX HW timestamping >> + * @pkts: Number of packets successfully timestamped by the queried >> + * layer. >> + * @lost: Number of packet timestamps that failed to get applied on a >> + * packet by the queried layer. >> + * @late: Number of packet timestamps that were delivered by the >> + * hardware but were lost due to arriving too late. >> + * @err: Number of timestamping errors that occurred on the queried >> + * layer. >> + */ >> +struct ethtool_ts_stats { >> + enum ethtool_ts_stats_layer layer; >> + struct_group(tx_stats, >> + u64 pkts; >> + u64 lost; >> + u64 late; >> + u64 err; >> + ); >> +}; > > The Intel ice drivers has the following Tx timestamp statistics: > > tx_hwtstamp_skipped - indicates when we get a Tx timestamp request but > are unable to fulfill it. > tx_hwtstamp_timeouts - indicates we had a Tx timestamp skb waiting for a > timestamp from hardware but it didn't get received within some internal > time limit. This is interesting. In mlx5 land, the only case where we are unable to fulfill a hwtstamp is when the timestamp information is lost or late. lost for us means that the timestamp never arrived within some internal time limit that our device will supposedly never be able to deliver timestamp information after that point. late for us is that we got hardware timestamp information delivered after that internal time limit. We are able to track this by using identifiers in our completion events and we only release references to these identifiers upon delivery (never delivering leaks the references. Enough build up leads to a recovery flow). The theory for us is that late timestamp information arrival after that period of time should not happen. However the truth is that it does happen and we want our driver implementation to be resilient to this case rather than trusting the time interval. Do you have any example of a case of skipping timestamp information that is not related to lack of delivery over time? I am wondering if this case is more like a hardware error or not. Or is it more like something along the lines of being busy/would impact line rate of timestamp information must be recorded? > tx_hwtstamp_flushed - indicates that we flushed an outstanding timestamp > before it completed, such as if the link resets or similar. > tx_hwtstamp_discarded - indicates that we obtained a timestamp from > hardware but were unable to complete it due to invalid cached data used > for timestamp extension. > > I think these could be translated roughly to one of the lost, late, or > err stats. I am a bit confused as to how drivers could distinguish > between lost and late, but I guess that depends on the specific hardware > design. > > In theory we could keep some of these more detailed stats but I don't > think we strictly need to be as detailed as the ice driver is. We also converged a statistic in the mlx5 driver to the simple error counter here. I think what makes sense is design specific counters should be exposed as driver specific counters and more common counters should be converged into the ethtool_ts_stats struct. > > The only major addition I think is the skipped stat, which I would > prefer to have. Perhaps that could be tracked in the netdev layer by > checking whether the skb flags to see whether or not the driver actually > set the appropriate flag? I guess the problem is how would the core stack know at what layer this was skipped at (I think Kory's patch series can be used to help with this since it's adding a common interface in ethtool to select the timestamping layer). As of today, mlx5 is the only driver I know of that supports selecting between the DMA and PHY layers for timestamp information. > > I think i can otherwise translate the flushed status to the lost > category, the timeout to the late category, and everything else to the > error category. I can easily add a counter to track completed timestamps > as well. Sounds good. -- Thanks, Rahul Rameshbabu
On 2/23/2024 2:21 PM, Rahul Rameshbabu wrote: >> The Intel ice drivers has the following Tx timestamp statistics: >> >> tx_hwtstamp_skipped - indicates when we get a Tx timestamp request but >> are unable to fulfill it. >> tx_hwtstamp_timeouts - indicates we had a Tx timestamp skb waiting for a >> timestamp from hardware but it didn't get received within some internal >> time limit. > > This is interesting. In mlx5 land, the only case where we are unable to > fulfill a hwtstamp is when the timestamp information is lost or late. > For ice, the timestamps are captured in the PHY and stored in a block of registers with limited slots. The driver tracks the available slots and uses one when a Tx timestamp request comes in. So we have "skipped" because its possible to request too many timestamps at once and fill up all the slots before the first timestamp is reported back. > lost for us means that the timestamp never arrived within some internal > time limit that our device will supposedly never be able to deliver > timestamp information after that point. > That is more or less the equivalent we have for timeout. > late for us is that we got hardware timestamp information delivered > after that internal time limit. We are able to track this by using > identifiers in our completion events and we only release references to > these identifiers upon delivery (never delivering leaks the references. > Enough build up leads to a recovery flow). The theory for us is that > late timestamp information arrival after that period of time should not > happen. However the truth is that it does happen and we want our driver > implementation to be resilient to this case rather than trusting the > time interval. > In our case, because of how the slots work, once we "timeout" a slot, it could get re-used. We set the timeout to be pretty ridiculous (1 second) to ensure that if we do timeout its almost certainly because hardware never timestamped the packet. > Do you have any example of a case of skipping timestamp information that > is not related to lack of delivery over time? I am wondering if this > case is more like a hardware error or not. Or is it more like something > along the lines of being busy/would impact line rate of timestamp > information must be recorded? > The main example for skipped is the event where all our slots are full at point of timestamp request. There have been a few rare cases where things like a link event or issues with the MAC dropping a packet where the PHY simply never gets the packet and thus never timestamps it. This is typically the result of a lost timestamp. Flushed, for us, is when we reset the timestamp block while it has timestamps oustanding. This can happen for example due to link changes, where we ultimately >> tx_hwtstamp_flushed - indicates that we flushed an outstanding timestamp >> before it completed, such as if the link resets or similar. >> tx_hwtstamp_discarded - indicates that we obtained a timestamp from >> hardware but were unable to complete it due to invalid cached data used >> for timestamp extension. >> >> I think these could be translated roughly to one of the lost, late, or >> err stats. I am a bit confused as to how drivers could distinguish >> between lost and late, but I guess that depends on the specific hardware >> design. >> >> In theory we could keep some of these more detailed stats but I don't >> think we strictly need to be as detailed as the ice driver is. > > We also converged a statistic in the mlx5 driver to the simple error > counter here. I think what makes sense is design specific counters > should be exposed as driver specific counters and more common counters > should be converged into the ethtool_ts_stats struct. > Sure that seems reasonable. >> >> The only major addition I think is the skipped stat, which I would >> prefer to have. Perhaps that could be tracked in the netdev layer by >> checking whether the skb flags to see whether or not the driver actually >> set the appropriate flag? > > I guess the problem is how would the core stack know at what layer this > was skipped at (I think Kory's patch series can be used to help with > this since it's adding a common interface in ethtool to select the > timestamping layer). As of today, mlx5 is the only driver I know of that > supports selecting between the DMA and PHY layers for timestamp > information. > Well, the way the interface worked in my understanding was that the core sets the SKBTX_HW_TSTAMP flag. The driver is supposed to then prepare the packet for timestamp and set the SKBTX_IN_PROGRESS flag. I just looked though, and it looks like ice doesn't actually set this flag... If we fixed this, in theory the stack should be able to check after the packet gets sent with SKBTX_HW_TSTAMP, if SKBTX_IN_PROGRESS isn't set then it would be a skipped timestamp? Its not really a huge deal, and this could just be lumped into either lost or err.
On Fri, 23 Feb, 2024 14:48:51 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: > On 2/23/2024 2:21 PM, Rahul Rameshbabu wrote: >>> The Intel ice drivers has the following Tx timestamp statistics: >>> >>> tx_hwtstamp_skipped - indicates when we get a Tx timestamp request but >>> are unable to fulfill it. >>> tx_hwtstamp_timeouts - indicates we had a Tx timestamp skb waiting for a >>> timestamp from hardware but it didn't get received within some internal >>> time limit. >> >> This is interesting. In mlx5 land, the only case where we are unable to >> fulfill a hwtstamp is when the timestamp information is lost or late. >> > > For ice, the timestamps are captured in the PHY and stored in a block of > registers with limited slots. The driver tracks the available slots and > uses one when a Tx timestamp request comes in. > > So we have "skipped" because its possible to request too many timestamps > at once and fill up all the slots before the first timestamp is reported > back. > >> lost for us means that the timestamp never arrived within some internal >> time limit that our device will supposedly never be able to deliver >> timestamp information after that point. >> > > That is more or less the equivalent we have for timeout. > >> late for us is that we got hardware timestamp information delivered >> after that internal time limit. We are able to track this by using >> identifiers in our completion events and we only release references to >> these identifiers upon delivery (never delivering leaks the references. >> Enough build up leads to a recovery flow). The theory for us is that >> late timestamp information arrival after that period of time should not >> happen. However the truth is that it does happen and we want our driver >> implementation to be resilient to this case rather than trusting the >> time interval. >> > > In our case, because of how the slots work, once we "timeout" a slot, it > could get re-used. We set the timeout to be pretty ridiculous (1 second) > to ensure that if we do timeout its almost certainly because hardware > never timestamped the packet. We were thinking about that design as well. We use a 1 second timeout to be safe. #define MLX5E_PTP_TS_CQE_UNDELIVERED_TIMEOUT (1 * NSEC_PER_SEC) Our device does not do any bookkeeping internally to prevent a completion event with timestamp information from arriving after 1 second. Some internal folks have said it shouldn't be possible, but we did not want to take any chances and built a model that is resilient to timestamp deliveries after any period of time even after consuming the skb without appending timestamp information. If no other vendor finds this useful, we could roll this up into the error counter and leave the late counter as vendor specific. I do not want to introduce too many counters that are hard to understand. We document the device specific counters on top of introducing them in the code base already. https://docs.kernel.org/networking/device_drivers/ethernet/mellanox/mlx5/counters.html > >> Do you have any example of a case of skipping timestamp information that >> is not related to lack of delivery over time? I am wondering if this >> case is more like a hardware error or not. Or is it more like something >> along the lines of being busy/would impact line rate of timestamp >> information must be recorded? >> > > The main example for skipped is the event where all our slots are full > at point of timestamp request. This is what I was guessing as the main (if not only reason). For this specific reason, I think a general "busy" stats counter makes sense. mlx5 does not need this counter, but I can see a lot of other hw implementations needing this. (The skipped counter name obviously should be left only in the ice driver. Just felt "busy" was easy to understand for generalized counters.) The reason why I prefer busy is that "skip" to me makes me think someone used SIOCSHWTSTAMP to filter which packets get timestamped which is very different from something like lack of resource availability. https://docs.kernel.org/networking/timestamping.html#hardware-timestamping-configuration-siocshwtstamp-and-siocghwtstamp > > There have been a few rare cases where things like a link event or > issues with the MAC dropping a packet where the PHY simply never gets > the packet and thus never timestamps it. This is typically the result of > a lost timestamp. > > Flushed, for us, is when we reset the timestamp block while it has > timestamps oustanding. This can happen for example due to link changes, > where we ultimately > >>> tx_hwtstamp_flushed - indicates that we flushed an outstanding timestamp >>> before it completed, such as if the link resets or similar. >>> tx_hwtstamp_discarded - indicates that we obtained a timestamp from >>> hardware but were unable to complete it due to invalid cached data used >>> for timestamp extension. >>> >>> I think these could be translated roughly to one of the lost, late, or >>> err stats. I am a bit confused as to how drivers could distinguish >>> between lost and late, but I guess that depends on the specific hardware >>> design. >>> >>> In theory we could keep some of these more detailed stats but I don't >>> think we strictly need to be as detailed as the ice driver is. >> >> We also converged a statistic in the mlx5 driver to the simple error >> counter here. I think what makes sense is design specific counters >> should be exposed as driver specific counters and more common counters >> should be converged into the ethtool_ts_stats struct. >> > > Sure that seems reasonable. > >>> >>> The only major addition I think is the skipped stat, which I would >>> prefer to have. Perhaps that could be tracked in the netdev layer by >>> checking whether the skb flags to see whether or not the driver actually >>> set the appropriate flag? >> >> I guess the problem is how would the core stack know at what layer this >> was skipped at (I think Kory's patch series can be used to help with >> this since it's adding a common interface in ethtool to select the >> timestamping layer). As of today, mlx5 is the only driver I know of that >> supports selecting between the DMA and PHY layers for timestamp >> information. >> > > Well, the way the interface worked in my understanding was that the core > sets the SKBTX_HW_TSTAMP flag. The driver is supposed to then prepare > the packet for timestamp and set the SKBTX_IN_PROGRESS flag. I just > looked though, and it looks like ice doesn't actually set this flag... That would be a good fix. We set this in mlx5. /* device driver is going to provide hardware time stamp */ SKBTX_IN_PROGRESS = 1 << 2, > > If we fixed this, in theory the stack should be able to check after the > packet gets sent with SKBTX_HW_TSTAMP, if SKBTX_IN_PROGRESS isn't set > then it would be a skipped timestamp? One question I have about this idea. Wouldn't SKBTX_IN_PROGRESS also not be set in the case when timestamp information is lost/a timeout occurs? I feel like the problem is not being able to separate these two cases from the perspective of the core stack. Btw, mlx5 does keep the flag even when we fail to write timestamp information..... I feel like it might be a good idea to add a warning in the core stack if both SKBTX_HW_TSTAMP and SKBTX_IN_PROGRESS are set but the skb is consumed without skb_hwtstamps(skb) being written by the driver before consuming the skb. > > Its not really a huge deal, and this could just be lumped into either > lost or err. -- Thanks, Rahul Rameshbabu
On Fri, 23 Feb 2024 11:24:45 -0800 Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote: > Multiple network devices that support hardware timestamping appear to have > common behavior with regards to timestamp handling. Implement common Tx > hardware timestamping statistics in a tx_stats struct_group. Common Rx > hardware timestamping statistics can subsequently be implemented in a > rx_stats struct_group for ethtool_ts_stats. > > Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> > Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> > +/** > + * enum ethtool_ts_stats_layer - layer to query hardware timestamping > statistics > + * @ETHTOOL_TS_STATS_LAYER_ACTIVE: > + * retrieve the statistics from the layer that is currently feeding > + * hardware timestamps for packets. > + * @ETHTOOL_TS_STATS_LAYER_DMA: > + * retrieve the statistics from the DMA hardware timestamping layer > of the > + * device. > + * @ETHTOOL_TS_STATS_PHY: > + * retrieve the statistics from the PHY hardware timestamping layer > of the > + * device. > + */ > +enum ethtool_ts_stats_layer { > + ETHTOOL_TS_STATS_LAYER_ACTIVE, > + ETHTOOL_TS_STATS_LAYER_DMA, > + ETHTOOL_TS_STATS_LAYER_PHY, > +}; The all point of my v8 series new implementation (asked by the maintainers) was to move on from the timestamp layer to the phc provider which is described by a phc index + phc qualifier (precise IEEE 1588/approx DMA). The struct being introduce in patch 9 of my series. You should do the same, use the phc provider instead of the layer. With using only the layer and in case of several PHYs we could not reach the right ts stats. Same goes for the MAC having both type of timestamp IEEE 1588 and DMA. Regards,
On Fri, 23 Feb 2024 11:24:45 -0800 Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote: > Multiple network devices that support hardware timestamping appear to have > common behavior with regards to timestamp handling. Implement common Tx > hardware timestamping statistics in a tx_stats struct_group. Common Rx > hardware timestamping statistics can subsequently be implemented in a > rx_stats struct_group for ethtool_ts_stats. > > Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> > Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> > + > +const struct nla_policy ethnl_stats_get_policy[__ETHTOOL_A_STATS_CNT] = { > + [ETHTOOL_A_STATS_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), > + [ETHTOOL_A_STATS_GROUPS] = { .type = NLA_NESTED }, > + [ETHTOOL_A_STATS_SRC] = > NLA_POLICY_MAX(NLA_U32, ETHTOOL_MAC_STATS_SRC_PMAC), > + [ETHTOOL_A_STATS_LAYER] = > + NLA_POLICY_MAX(NLA_U32, ETHTOOL_TS_STATS_LAYER_PHY), > }; You should add this new netlink attributes to the specs in a new patch to be able to test it. diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml index cfe48f8d6283..118508de2c88 100644 --- a/Documentation/netlink/specs/ethtool.yaml +++ b/Documentation/netlink/specs/ethtool.yaml @@ -859,6 +859,9 @@ attribute-sets: - name: src type: u32 + - + name: layer + type: u32 - name: phc-vclocks attributes: @@ -1526,6 +1529,7 @@ operations: attributes: - header - groups + - layer reply: attributes: - header Regards,
On 2/23/2024 3:43 PM, Rahul Rameshbabu wrote: > > On Fri, 23 Feb, 2024 14:48:51 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: >> On 2/23/2024 2:21 PM, Rahul Rameshbabu wrote: >>>> The Intel ice drivers has the following Tx timestamp statistics: >>>> >>>> tx_hwtstamp_skipped - indicates when we get a Tx timestamp request but >>>> are unable to fulfill it. >>>> tx_hwtstamp_timeouts - indicates we had a Tx timestamp skb waiting for a >>>> timestamp from hardware but it didn't get received within some internal >>>> time limit. >>> >>> This is interesting. In mlx5 land, the only case where we are unable to >>> fulfill a hwtstamp is when the timestamp information is lost or late. >>> >> >> For ice, the timestamps are captured in the PHY and stored in a block of >> registers with limited slots. The driver tracks the available slots and >> uses one when a Tx timestamp request comes in. >> >> So we have "skipped" because its possible to request too many timestamps >> at once and fill up all the slots before the first timestamp is reported >> back. >> >>> lost for us means that the timestamp never arrived within some internal >>> time limit that our device will supposedly never be able to deliver >>> timestamp information after that point. >>> >> >> That is more or less the equivalent we have for timeout. >> >>> late for us is that we got hardware timestamp information delivered >>> after that internal time limit. We are able to track this by using >>> identifiers in our completion events and we only release references to >>> these identifiers upon delivery (never delivering leaks the references. >>> Enough build up leads to a recovery flow). The theory for us is that >>> late timestamp information arrival after that period of time should not >>> happen. However the truth is that it does happen and we want our driver >>> implementation to be resilient to this case rather than trusting the >>> time interval. >>> >> >> In our case, because of how the slots work, once we "timeout" a slot, it >> could get re-used. We set the timeout to be pretty ridiculous (1 second) >> to ensure that if we do timeout its almost certainly because hardware >> never timestamped the packet. > > We were thinking about that design as well. We use a 1 second timeout to > be safe. > > #define MLX5E_PTP_TS_CQE_UNDELIVERED_TIMEOUT (1 * NSEC_PER_SEC) > > Our device does not do any bookkeeping internally to prevent a > completion event with timestamp information from arriving after 1 > second. Some internal folks have said it shouldn't be possible, but we > did not want to take any chances and built a model that is resilient to > timestamp deliveries after any period of time even after consuming the > skb without appending timestamp information. If no other vendor finds > this useful, we could roll this up into the error counter and leave the > late counter as vendor specific. I do not want to introduce too many > counters that are hard to understand. We document the device specific > counters on top of introducing them in the code base already. > > https://docs.kernel.org/networking/device_drivers/ethernet/mellanox/mlx5/counters.html > We can't distinguish "late". At best we could notice if we get a timestamp on an index thats not currently active, but we wouldn't know for sure if it was late from a previous request or due to some other programming error. >> >>> Do you have any example of a case of skipping timestamp information that >>> is not related to lack of delivery over time? I am wondering if this >>> case is more like a hardware error or not. Or is it more like something >>> along the lines of being busy/would impact line rate of timestamp >>> information must be recorded? >>> >> >> The main example for skipped is the event where all our slots are full >> at point of timestamp request. > > This is what I was guessing as the main (if not only reason). For this > specific reason, I think a general "busy" stats counter makes sense. > mlx5 does not need this counter, but I can see a lot of other hw > implementations needing this. (The skipped counter name obviously should > be left only in the ice driver. Just felt "busy" was easy to understand > for generalized counters.) Yea, I don't expect this would be required for all hardware but it seems like a common approach if you have limited slots for Tx timestamps available. > > The reason why I prefer busy is that "skip" to me makes me think someone > used SIOCSHWTSTAMP to filter which packets get timestamped which is very > different from something like lack of resource availability. > Busy is fine with me. >>>> The only major addition I think is the skipped stat, which I would >>>> prefer to have. Perhaps that could be tracked in the netdev layer by >>>> checking whether the skb flags to see whether or not the driver actually >>>> set the appropriate flag? >>> >>> I guess the problem is how would the core stack know at what layer this >>> was skipped at (I think Kory's patch series can be used to help with >>> this since it's adding a common interface in ethtool to select the >>> timestamping layer). As of today, mlx5 is the only driver I know of that >>> supports selecting between the DMA and PHY layers for timestamp >>> information. >>> >> >> Well, the way the interface worked in my understanding was that the core >> sets the SKBTX_HW_TSTAMP flag. The driver is supposed to then prepare >> the packet for timestamp and set the SKBTX_IN_PROGRESS flag. I just >> looked though, and it looks like ice doesn't actually set this flag... > > That would be a good fix. We set this in mlx5. > > /* device driver is going to provide hardware time stamp */ > SKBTX_IN_PROGRESS = 1 << 2, > Yea. I kind of wonder how necessary it is since we haven't been setting it and don't seem to have an existing bug report for this. I can dig through the kernel and see what it actually does... >> >> If we fixed this, in theory the stack should be able to check after the >> packet gets sent with SKBTX_HW_TSTAMP, if SKBTX_IN_PROGRESS isn't set >> then it would be a skipped timestamp? > > One question I have about this idea. Wouldn't SKBTX_IN_PROGRESS also not > be set in the case when timestamp information is lost/a timeout occurs? > I feel like the problem is not being able to separate these two cases > from the perspective of the core stack. > > Btw, mlx5 does keep the flag even when we fail to write timestamp > information..... I feel like it might be a good idea to add a warning in > the core stack if both SKBTX_HW_TSTAMP and SKBTX_IN_PROGRESS are set but > the skb is consumed without skb_hwtstamps(skb) being written by the > driver before consuming the skb. > I was thinking the check would happen much earlier, i.e. the moment we exit the driver xmit routines it would check whether SKBTX_IN_PROGRESS is set. This would be well before any actual Tx timestamp was acquired. Its basically a "if we set SKBTX_HW_TSTAMP and you didn't set IN_PROGRESS then we know you didn't even start a timestamp request, so you must have been busy" It might not be workable because I think the IN_PROGRESS flag is used for another purpose. I tried to read the documentation for it in Documentation, but I got confused a bit. I'm going to go through the code and see what places actually check this flag.
On Fri, 23 Feb 2024 11:24:45 -0800 Rahul Rameshbabu wrote: > +/** > + * struct ethtool_ts_stats - HW timestamping statistics > + * @layer: input field denoting whether stats should be queried from the DMA or > + * PHY timestamping layer. Defaults to the active layer for packet > + * timestamping. I think we're better off attaching this to an existing message in the dump (using ETHTOOL_FLAG_STATS / ethtool -I), like we do for pause, fec, etc., rather than having to build a separate hierarchy and copy identifiers within ETHTOOL_MSG_STATS_GET. Sorry if I mis-directed you this way.
On Wed, 28 Feb, 2024 18:05:20 -0800 Jakub Kicinski <kuba@kernel.org> wrote: > On Fri, 23 Feb 2024 11:24:45 -0800 Rahul Rameshbabu wrote: >> +/** >> + * struct ethtool_ts_stats - HW timestamping statistics >> + * @layer: input field denoting whether stats should be queried from the DMA or >> + * PHY timestamping layer. Defaults to the active layer for packet >> + * timestamping. > > I think we're better off attaching this to an existing message in the > dump (using ETHTOOL_FLAG_STATS / ethtool -I), like we do for pause, > fec, etc., rather than having to build a separate hierarchy and copy > identifiers within ETHTOOL_MSG_STATS_GET. > > Sorry if I mis-directed you this way. No worries. I can do that in my v2.
Hi Jacob, On Mon, 26 Feb, 2024 11:54:49 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: > On 2/23/2024 3:43 PM, Rahul Rameshbabu wrote: >> >> On Fri, 23 Feb, 2024 14:48:51 -0800 Jacob Keller <jacob.e.keller@intel.com> >> wrote: >>> On 2/23/2024 2:21 PM, Rahul Rameshbabu wrote: >>>> Do you have any example of a case of skipping timestamp information that >>>> is not related to lack of delivery over time? I am wondering if this >>>> case is more like a hardware error or not. Or is it more like something >>>> along the lines of being busy/would impact line rate of timestamp >>>> information must be recorded? >>>> >>> >>> The main example for skipped is the event where all our slots are full >>> at point of timestamp request. >> >> This is what I was guessing as the main (if not only reason). For this >> specific reason, I think a general "busy" stats counter makes sense. >> mlx5 does not need this counter, but I can see a lot of other hw >> implementations needing this. (The skipped counter name obviously should >> be left only in the ice driver. Just felt "busy" was easy to understand >> for generalized counters.) > > Yea, I don't expect this would be required for all hardware but it seems > like a common approach if you have limited slots for Tx timestamps > available. > Sorry to bump this thread once more, but I had a question regarding the Intel driver in regards to this. Instead of having a busy case when all the slots are full, would it make sense to stop the netdev queues in this case, we actually do this in mlx5 (though keep in mind that we have a dedicated queue just for port/phy timestamping that we start/stop). Maybe in your case, you can have a mix of HW timestamping and non-HW timestamping in the same queue, which is why you have a busy case? Wanted to inquire about this before sending out a RFC v2. >> >> The reason why I prefer busy is that "skip" to me makes me think someone >> used SIOCSHWTSTAMP to filter which packets get timestamped which is very >> different from something like lack of resource availability. >> > > Busy is fine with me. > -- Thanks, Rahul Rameshbabu
On 3/7/2024 10:47 AM, Rahul Rameshbabu wrote: > Hi Jacob, > > On Mon, 26 Feb, 2024 11:54:49 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: >> On 2/23/2024 3:43 PM, Rahul Rameshbabu wrote: >>> >>> On Fri, 23 Feb, 2024 14:48:51 -0800 Jacob Keller <jacob.e.keller@intel.com> >>> wrote: >>>> On 2/23/2024 2:21 PM, Rahul Rameshbabu wrote: >>>>> Do you have any example of a case of skipping timestamp information that >>>>> is not related to lack of delivery over time? I am wondering if this >>>>> case is more like a hardware error or not. Or is it more like something >>>>> along the lines of being busy/would impact line rate of timestamp >>>>> information must be recorded? >>>>> >>>> >>>> The main example for skipped is the event where all our slots are full >>>> at point of timestamp request. >>> >>> This is what I was guessing as the main (if not only reason). For this >>> specific reason, I think a general "busy" stats counter makes sense. >>> mlx5 does not need this counter, but I can see a lot of other hw >>> implementations needing this. (The skipped counter name obviously should >>> be left only in the ice driver. Just felt "busy" was easy to understand >>> for generalized counters.) >> >> Yea, I don't expect this would be required for all hardware but it seems >> like a common approach if you have limited slots for Tx timestamps >> available. >> > Sorry to bump this thread once more, but I had a question regarding the > Intel driver in regards to this. Instead of having a busy case when all > the slots are full, would it make sense to stop the netdev queues in > this case, we actually do this in mlx5 (though keep in mind that we have > a dedicated queue just for port/phy timestamping that we start/stop). > > Maybe in your case, you can have a mix of HW timestamping and non-HW > timestamping in the same queue, which is why you have a busy case? > We don't use a dedicated queue. The issue isn't queue capacity so much as it is the number of slots in the PHY for where it can save the timestamp data. In practice the most common application (ptp4l) synchronously waits for timestamps, and only has one outstanding at a time. Likely due to limitations with original hardware that only supported one outstanding Tx timestamp. > Wanted to inquire about this before sending out a RFC v2. That's actually an interesting approach to change to a dedicated queue which we could lock and start/stop it when the indexes are full. How does that interact with the stack UDP and Ethernet stacks? Presumably when you go to transmit, you'd need to pick a queue and if its stopped you'd have to drop or tell the stack? I think I remember someone experimenting with returning NETDEV_TX_BUSY when the slots were full, but in practice this caused a lot of issues. None of the other devices we have with only a single slot (one set of registers, ixgbe, i40e, igb, e1000) did that either. If this queue model behaves in a sane way (or if we can communicate something similar by reporting back up the stack without needing a dedicated queue?) that could be better than the current situation. >>> >>> The reason why I prefer busy is that "skip" to me makes me think someone >>> used SIOCSHWTSTAMP to filter which packets get timestamped which is very >>> different from something like lack of resource availability. >>> >> >> Busy is fine with me. >> > > -- > Thanks, > > Rahul Rameshbabu
On Thu, 07 Mar, 2024 19:29:08 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: > On 3/7/2024 10:47 AM, Rahul Rameshbabu wrote: >> Hi Jacob, >> >> On Mon, 26 Feb, 2024 11:54:49 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: >>> On 2/23/2024 3:43 PM, Rahul Rameshbabu wrote: >>>> >>>> On Fri, 23 Feb, 2024 14:48:51 -0800 Jacob Keller <jacob.e.keller@intel.com> >>>> wrote: >>>>> On 2/23/2024 2:21 PM, Rahul Rameshbabu wrote: >>>>>> Do you have any example of a case of skipping timestamp information that >>>>>> is not related to lack of delivery over time? I am wondering if this >>>>>> case is more like a hardware error or not. Or is it more like something >>>>>> along the lines of being busy/would impact line rate of timestamp >>>>>> information must be recorded? >>>>>> >>>>> >>>>> The main example for skipped is the event where all our slots are full >>>>> at point of timestamp request. >>>> >>>> This is what I was guessing as the main (if not only reason). For this >>>> specific reason, I think a general "busy" stats counter makes sense. >>>> mlx5 does not need this counter, but I can see a lot of other hw >>>> implementations needing this. (The skipped counter name obviously should >>>> be left only in the ice driver. Just felt "busy" was easy to understand >>>> for generalized counters.) >>> >>> Yea, I don't expect this would be required for all hardware but it seems >>> like a common approach if you have limited slots for Tx timestamps >>> available. >>> >> Sorry to bump this thread once more, but I had a question regarding the >> Intel driver in regards to this. Instead of having a busy case when all >> the slots are full, would it make sense to stop the netdev queues in >> this case, we actually do this in mlx5 (though keep in mind that we have >> a dedicated queue just for port/phy timestamping that we start/stop). >> >> Maybe in your case, you can have a mix of HW timestamping and non-HW >> timestamping in the same queue, which is why you have a busy case? >> > > We don't use a dedicated queue. The issue isn't queue capacity so much > as it is the number of slots in the PHY for where it can save the > timestamp data. In mlx5, we use a dedicated queue just for the purpose of HW timestamping because we actually do have a similar slot mechanism. We call it metadata. We have a limit of 256 entries. We steer PTP traffic specifically (though we will be changing this to any HW timestamped traffic with the work Kory is doing) to this queue by matching against the protocol and port. All other traffic goes to the normal queues that cannot consume the timestamping slots. When all the slots are occupied, we stop the timestamping queue rather than throwing some busy error. > > In practice the most common application (ptp4l) synchronously waits for > timestamps, and only has one outstanding at a time. Likely due to > limitations with original hardware that only supported one outstanding > Tx timestamp. > >> Wanted to inquire about this before sending out a RFC v2. > > That's actually an interesting approach to change to a dedicated queue > which we could lock and start/stop it when the indexes are full. How > does that interact with the stack UDP and Ethernet stacks? Presumably > when you go to transmit, you'd need to pick a queue and if its stopped > you'd have to drop or tell the stack? Let me share a pointer in mlx5 for how we do the queue selection. Like I mentioned, we steer ptp traffic specifically, but we can change this to just steer any skb that indicates hw timestamping. * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.h?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n71 Then, here is how we manage stopping and waking the queue (we tell the core stack about this so we do not have to drop traffic due to some kind of busy state because our metadata/slots are all consumed). * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n775 * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n257 * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n397 > > I think I remember someone experimenting with returning NETDEV_TX_BUSY > when the slots were full, but in practice this caused a lot of issues. > None of the other devices we have with only a single slot (one set of > registers, ixgbe, i40e, igb, e1000) did that either. So we experimented that even with a single slot (we had reasons for testing this), the dedicated queue for timestamping worked out nicely. I really would suggest investigating this model since I think it might play out nicely for the Intel family. > > If this queue model behaves in a sane way (or if we can communicate > something similar by reporting back up the stack without needing a > dedicated queue?) that could be better than the current situation. I personally really like the dedicated queue in the device drivers, but if we want to instead model this slot management work in the core netdev stack, I do not think that is a bad endeavor either (when slots are full, hw timestamping traffic is held back till they become available). I do think the netif_tx_wake_queue/netif_tx_stop_queue + dedicated HW timestamping queue does work out nicely. Let me know your thoughts on this. If you think it's an interesting idea to explore, lets not add the busy counter now in this series. I already dropped the late counter. We can add the busy counter later on if you feel this model I have shared is not viable for Intel. I wanted to avoid introducing too many counters pre-emptively that might not actually be consumed widely. I had a thought that what you presented with slots is very similar to what we have with metadata in mlx5, so I thought that maybe handling the management of these slots in a different way with something like a dedicated queue for HW timestamping could make the design cleaner. -- Thanks, Rahul Rameshbabu
On 3/7/2024 9:09 PM, Rahul Rameshbabu wrote: > > On Thu, 07 Mar, 2024 19:29:08 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: >> On 3/7/2024 10:47 AM, Rahul Rameshbabu wrote: >>> Hi Jacob, >>> >>> On Mon, 26 Feb, 2024 11:54:49 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: >>>> On 2/23/2024 3:43 PM, Rahul Rameshbabu wrote: >>>>> >>>>> On Fri, 23 Feb, 2024 14:48:51 -0800 Jacob Keller <jacob.e.keller@intel.com> >>>>> wrote: >>>>>> On 2/23/2024 2:21 PM, Rahul Rameshbabu wrote: >>>>>>> Do you have any example of a case of skipping timestamp information that >>>>>>> is not related to lack of delivery over time? I am wondering if this >>>>>>> case is more like a hardware error or not. Or is it more like something >>>>>>> along the lines of being busy/would impact line rate of timestamp >>>>>>> information must be recorded? >>>>>>> >>>>>> >>>>>> The main example for skipped is the event where all our slots are full >>>>>> at point of timestamp request. >>>>> >>>>> This is what I was guessing as the main (if not only reason). For this >>>>> specific reason, I think a general "busy" stats counter makes sense. >>>>> mlx5 does not need this counter, but I can see a lot of other hw >>>>> implementations needing this. (The skipped counter name obviously should >>>>> be left only in the ice driver. Just felt "busy" was easy to understand >>>>> for generalized counters.) >>>> >>>> Yea, I don't expect this would be required for all hardware but it seems >>>> like a common approach if you have limited slots for Tx timestamps >>>> available. >>>> >>> Sorry to bump this thread once more, but I had a question regarding the >>> Intel driver in regards to this. Instead of having a busy case when all >>> the slots are full, would it make sense to stop the netdev queues in >>> this case, we actually do this in mlx5 (though keep in mind that we have >>> a dedicated queue just for port/phy timestamping that we start/stop). >>> >>> Maybe in your case, you can have a mix of HW timestamping and non-HW >>> timestamping in the same queue, which is why you have a busy case? >>> >> >> We don't use a dedicated queue. The issue isn't queue capacity so much >> as it is the number of slots in the PHY for where it can save the >> timestamp data. > > In mlx5, we use a dedicated queue just for the purpose of HW > timestamping because we actually do have a similar slot mechanism. We > call it metadata. We have a limit of 256 entries. We steer PTP traffic > specifically (though we will be changing this to any HW timestamped > traffic with the work Kory is doing) to this queue by matching against > the protocol and port. All other traffic goes to the normal queues that > cannot consume the timestamping slots. When all the slots are occupied, > we stop the timestamping queue rather than throwing some busy error. > >> >> In practice the most common application (ptp4l) synchronously waits for >> timestamps, and only has one outstanding at a time. Likely due to >> limitations with original hardware that only supported one outstanding >> Tx timestamp. >> >>> Wanted to inquire about this before sending out a RFC v2. >> >> That's actually an interesting approach to change to a dedicated queue >> which we could lock and start/stop it when the indexes are full. How >> does that interact with the stack UDP and Ethernet stacks? Presumably >> when you go to transmit, you'd need to pick a queue and if its stopped >> you'd have to drop or tell the stack? > > Let me share a pointer in mlx5 for how we do the queue selection. Like I > mentioned, we steer ptp traffic specifically, but we can change this to > just steer any skb that indicates hw timestamping. > > * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.h?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n71 > > Then, here is how we manage stopping and waking the queue (we tell the > core stack about this so we do not have to drop traffic due to some kind > of busy state because our metadata/slots are all consumed). > Makes sense. > * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n775 > * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n257 > * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n397 > >> >> I think I remember someone experimenting with returning NETDEV_TX_BUSY >> when the slots were full, but in practice this caused a lot of issues. >> None of the other devices we have with only a single slot (one set of >> registers, ixgbe, i40e, igb, e1000) did that either. > > So we experimented that even with a single slot (we had reasons for > testing this), the dedicated queue for timestamping worked out nicely. I > really would suggest investigating this model since I think it might > play out nicely for the Intel family. > >> >> If this queue model behaves in a sane way (or if we can communicate >> something similar by reporting back up the stack without needing a >> dedicated queue?) that could be better than the current situation. > > I personally really like the dedicated queue in the device drivers, but > if we want to instead model this slot management work in the core netdev > stack, I do not think that is a bad endeavor either (when slots are > full, hw timestamping traffic is held back till they become available). > I do think the netif_tx_wake_queue/netif_tx_stop_queue + dedicated HW > timestamping queue does work out nicely. Ok so if I understand this right, .ndo_select_queue has the stack pick a queue, and we'd implement this to use the SKB flag. Then whenever the slots for the queue are full we issue netif_tx_stop_queue, and whenever the slots are released and we have slots open again we issue netif_tx_wake_queue.. While the queue is stopped, the stack basically just buffers requests and doesn't try to call the ndo_do_xmit routine for that queue until the queue is ready again? Using a dedicated queue has some other advantages in that it could be programmed with different priority both from the hardware side (prefer packets waiting in the timestamp queue) and from the software side (prioritize CPUs running the threads for processing it). That could be useful in some applications too... > > Let me know your thoughts on this. If you think it's an interesting idea > to explore, lets not add the busy counter now in this series. I already > dropped the late counter. We can add the busy counter later on if you > feel this model I have shared is not viable for Intel. I wanted to avoid > introducing too many counters pre-emptively that might not actually be > consumed widely. I had a thought that what you presented with slots is > very similar to what we have with metadata in mlx5, so I thought that > maybe handling the management of these slots in a different way with > something like a dedicated queue for HW timestamping could make the > design cleaner. > I think I agree with the queue model, though I'm not sure when I could get to working on implementing this. I'm fine with dropping the busy counter from this series. > -- > Thanks, > > Rahul Rameshbabu
On Fri, 08 Mar, 2024 14:28:01 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: > On 3/7/2024 9:09 PM, Rahul Rameshbabu wrote: >> >> On Thu, 07 Mar, 2024 19:29:08 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: >>> On 3/7/2024 10:47 AM, Rahul Rameshbabu wrote: >>>> Hi Jacob, >>>> >>>> On Mon, 26 Feb, 2024 11:54:49 -0800 Jacob Keller <jacob.e.keller@intel.com> wrote: >>>>> On 2/23/2024 3:43 PM, Rahul Rameshbabu wrote: >>>>>> >>>>>> On Fri, 23 Feb, 2024 14:48:51 -0800 Jacob Keller <jacob.e.keller@intel.com> >>>>>> wrote: >>>>>>> On 2/23/2024 2:21 PM, Rahul Rameshbabu wrote: >>>>>>>> Do you have any example of a case of skipping timestamp information that >>>>>>>> is not related to lack of delivery over time? I am wondering if this >>>>>>>> case is more like a hardware error or not. Or is it more like something >>>>>>>> along the lines of being busy/would impact line rate of timestamp >>>>>>>> information must be recorded? >>>>>>>> >>>>>>> >>>>>>> The main example for skipped is the event where all our slots are full >>>>>>> at point of timestamp request. >>>>>> >>>>>> This is what I was guessing as the main (if not only reason). For this >>>>>> specific reason, I think a general "busy" stats counter makes sense. >>>>>> mlx5 does not need this counter, but I can see a lot of other hw >>>>>> implementations needing this. (The skipped counter name obviously should >>>>>> be left only in the ice driver. Just felt "busy" was easy to understand >>>>>> for generalized counters.) >>>>> >>>>> Yea, I don't expect this would be required for all hardware but it seems >>>>> like a common approach if you have limited slots for Tx timestamps >>>>> available. >>>>> >>>> Sorry to bump this thread once more, but I had a question regarding the >>>> Intel driver in regards to this. Instead of having a busy case when all >>>> the slots are full, would it make sense to stop the netdev queues in >>>> this case, we actually do this in mlx5 (though keep in mind that we have >>>> a dedicated queue just for port/phy timestamping that we start/stop). >>>> >>>> Maybe in your case, you can have a mix of HW timestamping and non-HW >>>> timestamping in the same queue, which is why you have a busy case? >>>> >>> >>> We don't use a dedicated queue. The issue isn't queue capacity so much >>> as it is the number of slots in the PHY for where it can save the >>> timestamp data. >> >> In mlx5, we use a dedicated queue just for the purpose of HW >> timestamping because we actually do have a similar slot mechanism. We >> call it metadata. We have a limit of 256 entries. We steer PTP traffic >> specifically (though we will be changing this to any HW timestamped >> traffic with the work Kory is doing) to this queue by matching against >> the protocol and port. All other traffic goes to the normal queues that >> cannot consume the timestamping slots. When all the slots are occupied, >> we stop the timestamping queue rather than throwing some busy error. >> >>> >>> In practice the most common application (ptp4l) synchronously waits for >>> timestamps, and only has one outstanding at a time. Likely due to >>> limitations with original hardware that only supported one outstanding >>> Tx timestamp. >>> >>>> Wanted to inquire about this before sending out a RFC v2. >>> >>> That's actually an interesting approach to change to a dedicated queue >>> which we could lock and start/stop it when the indexes are full. How >>> does that interact with the stack UDP and Ethernet stacks? Presumably >>> when you go to transmit, you'd need to pick a queue and if its stopped >>> you'd have to drop or tell the stack? >> >> Let me share a pointer in mlx5 for how we do the queue selection. Like I >> mentioned, we steer ptp traffic specifically, but we can change this to >> just steer any skb that indicates hw timestamping. >> >> * >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.h?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n71 >> >> Then, here is how we manage stopping and waking the queue (we tell the >> core stack about this so we do not have to drop traffic due to some kind >> of busy state because our metadata/slots are all consumed). >> > > Makes sense. > >> * >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n775 >> * >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n257 >> * >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n397 >> >>> >>> I think I remember someone experimenting with returning NETDEV_TX_BUSY >>> when the slots were full, but in practice this caused a lot of issues. >>> None of the other devices we have with only a single slot (one set of >>> registers, ixgbe, i40e, igb, e1000) did that either. >> >> So we experimented that even with a single slot (we had reasons for >> testing this), the dedicated queue for timestamping worked out nicely. I >> really would suggest investigating this model since I think it might >> play out nicely for the Intel family. >> >>> >>> If this queue model behaves in a sane way (or if we can communicate >>> something similar by reporting back up the stack without needing a >>> dedicated queue?) that could be better than the current situation. >> >> I personally really like the dedicated queue in the device drivers, but >> if we want to instead model this slot management work in the core netdev >> stack, I do not think that is a bad endeavor either (when slots are >> full, hw timestamping traffic is held back till they become available). >> I do think the netif_tx_wake_queue/netif_tx_stop_queue + dedicated HW >> timestamping queue does work out nicely. > > Ok so if I understand this right, .ndo_select_queue has the stack pick a > queue, and we'd implement this to use the SKB flag. Then whenever the > slots for the queue are full we issue netif_tx_stop_queue, and whenever > the slots are released and we have slots open again we issue > netif_tx_wake_queue.. Your summary here is correct. > > While the queue is stopped, the stack basically just buffers requests > and doesn't try to call the ndo_do_xmit routine for that queue until the > queue is ready again? Yes, that's right. Because of this, you can avoid needing to report a busy state and just let the stack buffer till the device backpressure for timestamping resources is released (timestamp information is retrieved from the device, making the slots available). > > Using a dedicated queue has some other advantages in that it could be > programmed with different priority both from the hardware side (prefer > packets waiting in the timestamp queue) and from the software side > (prioritize CPUs running the threads for processing it). That could be > useful in some applications too... > >> >> Let me know your thoughts on this. If you think it's an interesting idea >> to explore, lets not add the busy counter now in this series. I already >> dropped the late counter. We can add the busy counter later on if you >> feel this model I have shared is not viable for Intel. I wanted to avoid >> introducing too many counters pre-emptively that might not actually be >> consumed widely. I had a thought that what you presented with slots is >> very similar to what we have with metadata in mlx5, so I thought that >> maybe handling the management of these slots in a different way with >> something like a dedicated queue for HW timestamping could make the >> design cleaner. >> > > I think I agree with the queue model, though I'm not sure when I could > get to working on implementing this. I'm fine with dropping the busy > counter from this series. No worries. If you are interested in further discussion or any kind of RFC review, I am open to this. If you feel this model is not satisfactory in the future, we can discuss this and then look at adding the busy counter. -- Thanks, Rahul Rameshbabu
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h index b90c33607594..1cc1010afaca 100644 --- a/include/linux/ethtool.h +++ b/include/linux/ethtool.h @@ -483,6 +483,31 @@ struct ethtool_rmon_stats { ); }; +/** + * struct ethtool_ts_stats - HW timestamping statistics + * @layer: input field denoting whether stats should be queried from the DMA or + * PHY timestamping layer. Defaults to the active layer for packet + * timestamping. + * @tx_stats: struct group for TX HW timestamping + * @pkts: Number of packets successfully timestamped by the queried + * layer. + * @lost: Number of packet timestamps that failed to get applied on a + * packet by the queried layer. + * @late: Number of packet timestamps that were delivered by the + * hardware but were lost due to arriving too late. + * @err: Number of timestamping errors that occurred on the queried + * layer. + */ +struct ethtool_ts_stats { + enum ethtool_ts_stats_layer layer; + struct_group(tx_stats, + u64 pkts; + u64 lost; + u64 late; + u64 err; + ); +}; + #define ETH_MODULE_EEPROM_PAGE_LEN 128 #define ETH_MODULE_MAX_I2C_ADDRESS 0x7f @@ -807,6 +832,7 @@ struct ethtool_rxfh_param { * @get_eth_ctrl_stats: Query some of the IEEE 802.3 MAC Ctrl statistics. * @get_rmon_stats: Query some of the RMON (RFC 2819) statistics. * Set %ranges to a pointer to zero-terminated array of byte ranges. + * @get_eth_ts_stats: Query the device hardware timestamping statistics. * @get_module_power_mode: Get the power mode policy for the plug-in module * used by the network device and its operational power mode, if * plugged-in. @@ -943,6 +969,8 @@ struct ethtool_ops { void (*get_rmon_stats)(struct net_device *dev, struct ethtool_rmon_stats *rmon_stats, const struct ethtool_rmon_hist_range **ranges); + void (*get_ts_stats)(struct net_device *dev, + struct ethtool_ts_stats *ts_stats); int (*get_module_power_mode)(struct net_device *dev, struct ethtool_module_power_mode_params *params, struct netlink_ext_ack *extack); diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h index 06ef6b78b7de..39edae554fc5 100644 --- a/include/uapi/linux/ethtool.h +++ b/include/uapi/linux/ethtool.h @@ -681,6 +681,7 @@ enum ethtool_link_ext_substate_module { * @ETH_SS_STATS_ETH_MAC: names of IEEE 802.3 MAC statistics * @ETH_SS_STATS_ETH_CTRL: names of IEEE 802.3 MAC Control statistics * @ETH_SS_STATS_RMON: names of RMON statistics + * @ETH_SS_STATS_TS: names of hardware timestamping statistics * * @ETH_SS_COUNT: number of defined string sets */ @@ -706,6 +707,7 @@ enum ethtool_stringset { ETH_SS_STATS_ETH_MAC, ETH_SS_STATS_ETH_CTRL, ETH_SS_STATS_RMON, + ETH_SS_STATS_TS, /* add new constants above here */ ETH_SS_COUNT @@ -1462,6 +1464,24 @@ struct ethtool_ts_info { __u32 rx_reserved[3]; }; +/** + * enum ethtool_ts_stats_layer - layer to query hardware timestamping statistics + * @ETHTOOL_TS_STATS_LAYER_ACTIVE: + * retrieve the statistics from the layer that is currently feeding + * hardware timestamps for packets. + * @ETHTOOL_TS_STATS_LAYER_DMA: + * retrieve the statistics from the DMA hardware timestamping layer of the + * device. + * @ETHTOOL_TS_STATS_PHY: + * retrieve the statistics from the PHY hardware timestamping layer of the + * device. + */ +enum ethtool_ts_stats_layer { + ETHTOOL_TS_STATS_LAYER_ACTIVE, + ETHTOOL_TS_STATS_LAYER_DMA, + ETHTOOL_TS_STATS_LAYER_PHY, +}; + /* * %ETHTOOL_SFEATURES changes features present in features[].valid to the * values of corresponding bits in features[].requested. Bits in .requested diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index 3f89074aa06c..55f2a3c8caa0 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -749,6 +749,7 @@ enum { ETHTOOL_A_STATS_GRP, /* nest - _A_STATS_GRP_* */ ETHTOOL_A_STATS_SRC, /* u32 */ + ETHTOOL_A_STATS_LAYER, /* u32 */ /* add new constants above here */ __ETHTOOL_A_STATS_CNT, @@ -760,6 +761,7 @@ enum { ETHTOOL_STATS_ETH_MAC, ETHTOOL_STATS_ETH_CTRL, ETHTOOL_STATS_RMON, + ETHTOOL_STATS_TS, /* add new constants above here */ __ETHTOOL_STATS_CNT @@ -875,6 +877,21 @@ enum { ETHTOOL_A_STATS_RMON_MAX = (__ETHTOOL_A_STATS_RMON_CNT - 1) }; +enum { + /* hwTimestampingTxPkts */ + ETHTOOL_A_STATS_TS_TX_PKT, + /* hwTimestampingTxLost */ + ETHTOOL_A_STATS_TS_TX_LOST, + /* hwTimestampingTxLate */ + ETHTOOL_A_STATS_TS_TX_LATE, + /* hwTimestampingTxErrors */ + ETHTOOL_A_STATS_TS_TX_ERRORS, + + /* add new constants above here */ + __ETHTOOL_A_STATS_TS_CNT, + ETHTOOL_A_STATS_TS_MAX = (__ETHTOOL_A_STATS_TS_CNT - 1) +}; + /* MODULE */ enum { diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h index 9a333a8d04c1..962ecd62aeea 100644 --- a/net/ethtool/netlink.h +++ b/net/ethtool/netlink.h @@ -429,7 +429,7 @@ extern const struct nla_policy ethnl_tunnel_info_get_policy[ETHTOOL_A_TUNNEL_INF extern const struct nla_policy ethnl_fec_get_policy[ETHTOOL_A_FEC_HEADER + 1]; extern const struct nla_policy ethnl_fec_set_policy[ETHTOOL_A_FEC_AUTO + 1]; extern const struct nla_policy ethnl_module_eeprom_get_policy[ETHTOOL_A_MODULE_EEPROM_I2C_ADDRESS + 1]; -extern const struct nla_policy ethnl_stats_get_policy[ETHTOOL_A_STATS_SRC + 1]; +extern const struct nla_policy ethnl_stats_get_policy[__ETHTOOL_A_STATS_CNT]; extern const struct nla_policy ethnl_phc_vclocks_get_policy[ETHTOOL_A_PHC_VCLOCKS_HEADER + 1]; extern const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1]; extern const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_POWER_MODE_POLICY + 1]; @@ -454,5 +454,6 @@ extern const char stats_eth_phy_names[__ETHTOOL_A_STATS_ETH_PHY_CNT][ETH_GSTRING extern const char stats_eth_mac_names[__ETHTOOL_A_STATS_ETH_MAC_CNT][ETH_GSTRING_LEN]; extern const char stats_eth_ctrl_names[__ETHTOOL_A_STATS_ETH_CTRL_CNT][ETH_GSTRING_LEN]; extern const char stats_rmon_names[__ETHTOOL_A_STATS_RMON_CNT][ETH_GSTRING_LEN]; +extern const char stats_ts_names[__ETHTOOL_A_STATS_TS_CNT][ETH_GSTRING_LEN]; #endif /* _NET_ETHTOOL_NETLINK_H */ diff --git a/net/ethtool/stats.c b/net/ethtool/stats.c index 912f0c4fff2f..e4333d77fb31 100644 --- a/net/ethtool/stats.c +++ b/net/ethtool/stats.c @@ -8,6 +8,7 @@ struct stats_req_info { struct ethnl_req_info base; DECLARE_BITMAP(stat_mask, __ETHTOOL_STATS_CNT); enum ethtool_mac_stats_src src; + enum ethtool_ts_stats_layer layer; }; #define STATS_REQINFO(__req_base) \ @@ -20,6 +21,7 @@ struct stats_reply_data { struct ethtool_eth_mac_stats mac_stats; struct ethtool_eth_ctrl_stats ctrl_stats; struct ethtool_rmon_stats rmon_stats; + struct ethtool_ts_stats ts_stats; ); const struct ethtool_rmon_hist_range *rmon_ranges; }; @@ -32,6 +34,7 @@ const char stats_std_names[__ETHTOOL_STATS_CNT][ETH_GSTRING_LEN] = { [ETHTOOL_STATS_ETH_MAC] = "eth-mac", [ETHTOOL_STATS_ETH_CTRL] = "eth-ctrl", [ETHTOOL_STATS_RMON] = "rmon", + [ETHTOOL_STATS_TS] = "ts", }; const char stats_eth_phy_names[__ETHTOOL_A_STATS_ETH_PHY_CNT][ETH_GSTRING_LEN] = { @@ -76,18 +79,27 @@ const char stats_rmon_names[__ETHTOOL_A_STATS_RMON_CNT][ETH_GSTRING_LEN] = { [ETHTOOL_A_STATS_RMON_JABBER] = "etherStatsJabbers", }; -const struct nla_policy ethnl_stats_get_policy[ETHTOOL_A_STATS_SRC + 1] = { - [ETHTOOL_A_STATS_HEADER] = - NLA_POLICY_NESTED(ethnl_header_policy), - [ETHTOOL_A_STATS_GROUPS] = { .type = NLA_NESTED }, - [ETHTOOL_A_STATS_SRC] = +const char stats_ts_names[__ETHTOOL_A_STATS_TS_CNT][ETH_GSTRING_LEN] = { + [ETHTOOL_A_STATS_TS_TX_PKT] = "hwTimestampingTxPkts", + [ETHTOOL_A_STATS_TS_TX_LOST] = "hwTimestampingTxLost", + [ETHTOOL_A_STATS_TS_TX_LATE] = "hwTimestampingTxLate", + [ETHTOOL_A_STATS_TS_TX_ERRORS] = "hwTimestampingTxErrors", +}; + +const struct nla_policy ethnl_stats_get_policy[__ETHTOOL_A_STATS_CNT] = { + [ETHTOOL_A_STATS_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), + [ETHTOOL_A_STATS_GROUPS] = { .type = NLA_NESTED }, + [ETHTOOL_A_STATS_SRC] = NLA_POLICY_MAX(NLA_U32, ETHTOOL_MAC_STATS_SRC_PMAC), + [ETHTOOL_A_STATS_LAYER] = + NLA_POLICY_MAX(NLA_U32, ETHTOOL_TS_STATS_LAYER_PHY), }; static int stats_parse_request(struct ethnl_req_info *req_base, struct nlattr **tb, struct netlink_ext_ack *extack) { + enum ethtool_ts_stats_layer layer = ETHTOOL_TS_STATS_LAYER_ACTIVE; enum ethtool_mac_stats_src src = ETHTOOL_MAC_STATS_SRC_AGGREGATE; struct stats_req_info *req_info = STATS_REQINFO(req_base); bool mod = false; @@ -104,9 +116,12 @@ static int stats_parse_request(struct ethnl_req_info *req_base, return -EINVAL; } + if (tb[ETHTOOL_A_STATS_LAYER]) + layer = nla_get_u32(tb[ETHTOOL_A_STATS_LAYER]); if (tb[ETHTOOL_A_STATS_SRC]) src = nla_get_u32(tb[ETHTOOL_A_STATS_SRC]); + req_info->layer = layer; req_info->src = src; return 0; @@ -118,6 +133,7 @@ static int stats_prepare_data(const struct ethnl_req_info *req_base, { const struct stats_req_info *req_info = STATS_REQINFO(req_base); struct stats_reply_data *data = STATS_REPDATA(reply_base); + enum ethtool_ts_stats_layer layer = req_info->layer; enum ethtool_mac_stats_src src = req_info->src; struct net_device *dev = reply_base->dev; int ret; @@ -144,6 +160,7 @@ static int stats_prepare_data(const struct ethnl_req_info *req_base, data->mac_stats.src = src; data->ctrl_stats.src = src; data->rmon_stats.src = src; + data->ts_stats.layer = layer; if (test_bit(ETHTOOL_STATS_ETH_PHY, req_info->stat_mask) && dev->ethtool_ops->get_eth_phy_stats) @@ -158,6 +175,9 @@ static int stats_prepare_data(const struct ethnl_req_info *req_base, dev->ethtool_ops->get_rmon_stats) dev->ethtool_ops->get_rmon_stats(dev, &data->rmon_stats, &data->rmon_ranges); + if (test_bit(ETHTOOL_STATS_TS, req_info->stat_mask) && + dev->ethtool_ops->get_ts_stats) + dev->ethtool_ops->get_ts_stats(dev, &data->ts_stats); ethnl_ops_complete(dev); return 0; @@ -194,6 +214,10 @@ static int stats_reply_size(const struct ethnl_req_info *req_base, nla_total_size(4)) * /* _A_STATS_GRP_HIST_BKT_HI */ ETHTOOL_RMON_HIST_MAX * 2; } + if (test_bit(ETHTOOL_STATS_TS, req_info->stat_mask)) { + n_stats += sizeof(struct ethtool_ts_stats) / sizeof(u64); + n_grps++; + } len += n_grps * (nla_total_size(0) + /* _A_STATS_GRP */ nla_total_size(4) + /* _A_STATS_GRP_ID */ @@ -370,6 +394,22 @@ static int stats_put_rmon_stats(struct sk_buff *skb, return 0; } +static int stats_put_ts_stats(struct sk_buff *skb, + const struct stats_reply_data *data) +{ + if (stat_put(skb, ETHTOOL_A_STATS_TS_TX_PKT, + data->ts_stats.pkts) || + stat_put(skb, ETHTOOL_A_STATS_TS_TX_LOST, + data->ts_stats.lost) || + stat_put(skb, ETHTOOL_A_STATS_TS_TX_LATE, + data->ts_stats.late) || + stat_put(skb, ETHTOOL_A_STATS_TS_TX_ERRORS, + data->ts_stats.err)) + return -EMSGSIZE; + + return 0; +} + static int stats_put_stats(struct sk_buff *skb, const struct stats_reply_data *data, u32 id, u32 ss_id, @@ -423,6 +463,9 @@ static int stats_fill_reply(struct sk_buff *skb, if (!ret && test_bit(ETHTOOL_STATS_RMON, req_info->stat_mask)) ret = stats_put_stats(skb, data, ETHTOOL_STATS_RMON, ETH_SS_STATS_RMON, stats_put_rmon_stats); + if (!ret && test_bit(ETHTOOL_STATS_TS, req_info->stat_mask)) + ret = stats_put_stats(skb, data, ETHTOOL_STATS_TS, + ETH_SS_STATS_TS, stats_put_ts_stats); return ret; } diff --git a/net/ethtool/strset.c b/net/ethtool/strset.c index c678b484a079..ce1e193076c3 100644 --- a/net/ethtool/strset.c +++ b/net/ethtool/strset.c @@ -105,6 +105,11 @@ static const struct strset_info info_template[] = { .count = __ETHTOOL_A_STATS_RMON_CNT, .strings = stats_rmon_names, }, + [ETH_SS_STATS_TS] = { + .per_dev = false, + .count = __ETHTOOL_A_STATS_TS_CNT, + .strings = stats_ts_names, + }, }; struct strset_req_info {