Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric cardinality #43

Open
SamBarker opened this issue Oct 2, 2023 · 3 comments
Open

Metric cardinality #43

SamBarker opened this issue Oct 2, 2023 · 3 comments

Comments

@SamBarker
Copy link

Now I see that tags like this it makes me wonder about the metric cardinality we have going on here. It's fine for smaller clusters, but it grows like $numBrokers^2$, so a 60 broker cluster would have 3,600 metrics for each of these.

Originally posted by @tombentley in #42 (comment)

@robobario
Copy link

I'm trying to imagine who/what these metrics would be used for.

If it's for the user to monitory how close they are to a violation we could gauge smallestAvailableLogdirBytes and smallestAvailableLogdirPercent to make it visible how close we are to a violation, knowing that the logic kicks in if any logdir violates the policy. There is of course less information to debug which logdir was observed.

Another case might be debugging why the plugin has decided to throttle. The throttling has kicked in and the user is trying to trace why and what was observed. Maybe this is better handled with logging. When we decide to change the factor we could log more about the observation, like throttleFactor changing to XYZ, triggered by observation [...] which produced limit violations [...]. These state changes are likely what the user cares about in this disaster case.

@tombentley
Copy link
Member

tombentley commented Oct 4, 2023

Do we need to know the metrics from the plugin in all 59 brokers when 1 broker has some issue. It is enough to know that a subset of the brokers all agree that that 1 broker has a problem? If so, then we could construct some expression involving broker ids, the cluster size and the number of observations we do want and use that to drive which brokers publish metrics about which other others. Something like x != y && 0 < (y-x) && (y-x) < 3, where x and y are broker ids?

@SamBarker
Copy link
Author

Yes the cardinality explosion is potentially problematic however there a few separate mitigations here.

  1. Users have to opt into using the plug-in. For users operating large scale clusters are I'm not convinced this plug-in is the the right shape for them to begin with. Do you really want to block access to all 60 nodes when three of them are running out of disk space?
  2. Users have to opt into consuming the metrics and pushing them to their metrics infrastructure. They can filter at ingestion or use very short retention times to manage this.
  3. Users running large Kafka clusters already have large numbers of metrics to contend with are we making the problem materially worse?
  4. Is this actually an issue for people with large clusters?
  5. Switching to logging only makes life harder for debugging issues and as things scale it only gets worse

In small clusters I think we do want the full breakdown of metrics so that users can identify split brain/connectivity issues or issues in the plug-in.

Potential mitigations would be:

  1. Make tagged metrics opt in or opt out (not sure which would make more sense). If externally co-ordinated this could mean its only enabled on representative subsets of nodes (e.g. once per rack on the assumption that inter rack connectivity is much less reliable than intra rack)
  2. Assume intra rack connectivity is reliable and just add tagged metrics for brokers from differing racks
  3. Add a regex to only add tagged metrics where broker addresses match the pattern.
  4. Stop having tagged metrics entirely.

Compromising the experience for the majority of users just in case we potentially inconvenience exceptional users seems like the wrong trade off to me. So I suggest we close this for now, and revisit if it actually proves to be an issue for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants