Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new rule of thumb #161

Open
Expertium opened this issue Jan 15, 2024 · 2 comments
Open

Add a new rule of thumb #161

Expertium opened this issue Jan 15, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@Expertium
Copy link

Expertium commented Jan 15, 2024

There is a rule of thumb which should, in theory, perform better than Silverman's rule. Here is the relevant paper: https://www.hindawi.com/journals/jps/2015/242683/
And here's my simple Python implementation for one-dimensional data:

def chens_rule(data):
    std = np.std(data)
    IQR = (np.percentile(data, q=75) - np.percentile(data, q=25)) / 1.3489795003921634
    scale = min(IQR, std)
    mean = np.mean(data)
    n = len(data)
    if mean != 0 and scale > 0:
        cv = (1 + 1 / (4 * n)) * scale / mean  # corrected for small sample size
        h = ((4 * (2 + cv ** 2)) ** (1 / 5)) * scale * (n ** (-2 / 5))
        return h
    else:
        raise Exception("Chen's rule failed")

Note that I added two changes compared to the original paper:

  1. The estimate of scale is not exactly the same as the standard deviation: I changed it to make it more robust, similar to the Silverman's rule
  2. I added a sample size correction to the coefficient of variation. However, it's only appropriate for normally distributed data, so I'm not entirely sure whether it should be used
@tommyod
Copy link
Owner

tommyod commented Jan 15, 2024

Hi, thanks for bringing this to my attention.

However, if you want this to go into KDEpy you will have a write and PR for it. I'm open to it if it seems to perform well on data. I would want to see it against Silverman on a plot.

@tommyod tommyod added the enhancement New feature or request label Jan 15, 2024
@Expertium
Copy link
Author

Expertium commented Jan 18, 2024

Unfortunately, I'm not that good at coding, making a fully fleshed out version of this function is beyond my abilities. The best I can do is add support for weighting, at least for one-dimensional data:

def chen_rule(data, weights=None):
    # https://www.hindawi.com/journals/jps/2015/242683/
    if weights is None:
        weights = np.ones_like(data)

    def weighted_percentile(data, weights, q):
        # q must be between 0 and 1
        ix = np.argsort(data)
        data = data[ix]  # sort data
        weights = weights[ix]  # sort weights
        C = 1
        cdf = (np.cumsum(weights) - C * weights) / (np.sum(weights) + (1 - 2 * C) * weights)  # 'like' a CDF function
        return np.interp(q, cdf, data)  # when all weights are equal to 1, this is equivalent to using 'linear' in np.percentile

    std = np.sqrt(np.cov(data, aweights=weights))
    IQR = (weighted_percentile(data, weights, 0.75) - weighted_percentile(data, weights, 0.25)) / 1.3489795003921634
    scale = min(IQR, std)
    mean = np.average(data, weights=weights)
    n = len(data)
    if mean != 0 and scale > 0:
        cv = (1 + 1/(4 * n)) * scale / mean  # corrected for small sample size
        h = ((4 * (2 + cv ** 2)) ** (1/5)) * scale * (n ** (-2/5))
        return h
    else:
        raise Exception("Chen's rule failed")

As for plotting, here's a comparison using the standard normal distribution:
Figure_1

Just by eyeballing it, Chen seems to be a little closer to the "true" distribution than Silverman, although maybe I just got lucky with this sample. Improved Sheather-Jones doesn't perform well: the line is very "squiggly" and not smooth. The issue with Chen's rule is that cv = (1 + 1/(4 * n)) * scale / mean approaches infinity as the the mean approaches 0.
However, I think it's more interesting to compare them on a distribution that is heavily skewed:
Figure_2

Chen is in between Silverman and ISJ in terms of smoothness. I think this part demonstrates it especially well:
image
Silverman (orange) is the smoothest, Chen (blue) is more squiggly, and ISJ (green) is the most squiggly one. Obviously, this isn't a rigorous analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants