Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent behavior when using custom grid #15

Closed
blasern opened this issue Feb 18, 2019 · 4 comments · Fixed by #16
Closed

inconsistent behavior when using custom grid #15

blasern opened this issue Feb 18, 2019 · 4 comments · Fixed by #16
Assignees

Comments

@blasern
Copy link

blasern commented Feb 18, 2019

FFTKDE seems to change the grid internally when providing a custom grid in 2d. Results do not seem to match with the results from NaiveKDE or TreeKDE. See below for an example

# imports
import numpy as np
import matplotlib.pyplot as plt
import KDEpy
%matplotlib inline

# Create bimodal 2D data
data = np.vstack((np.random.randn(2**8, 2), 
                  np.random.randn(2**8, 2) + (0, 5)))

# Create 2D grid
grid_size = 20
min_X = np.min(data, axis=0)
max_X = np.max(data, axis=0)
grid_margins = tuple(np.linspace(mn, mx, grid_size)
                     for mn, mx in zip(min_X, max_X))
grid = np.stack(np.meshgrid(*grid_margins), -1).reshape(-1, len(grid_margins))

# density estimates
y1 = KDEpy.FFTKDE(bw=0.2).fit(data).evaluate(grid)
y2 = KDEpy.TreeKDE(bw=0.2).fit(data).evaluate(grid)
y3 = KDEpy.NaiveKDE(bw=0.2).fit(data).evaluate(grid)

# plots
fig, axes = plt.subplots(2, 2,  figsize = (10, 10))
axes[0, 0].set_title('Data')
axes[0, 0].scatter(data[:, 0], data[:, 1])
axes[0, 0].set_title('FFTKDE')
axes[0, 1].scatter(grid[:, 0], grid[:, 1], c=y1)
axes[0, 0].set_title('TreeKDE')
axes[1, 0].scatter(grid[:, 0], grid[:, 1], c=y2)
axes[0, 0].set_title('Naive')
axes[1, 1].scatter(grid[:, 0], grid[:, 1], c=y3)
@tommyod tommyod self-assigned this Feb 18, 2019
@tommyod
Copy link
Owner

tommyod commented Feb 18, 2019

Thanks for raising the Issue @blasern . I agree that something is wrong here. I hope to look into this bug very soon, this weekend or the next. I will report my findings here and patch it up with a regression test.

@tommyod
Copy link
Owner

tommyod commented Feb 19, 2019

I believe I've found the problem.

Notice first that the following code produces incorrect results

# imports
import numpy as np
import matplotlib.pyplot as plt
import KDEpy

# Create bimodal 2D data
data = np.array([[-1, 1], [0, 1], [1, 1], [-1, -1], [0, -1], [1, -1]])

# Create 2D grid
grid_x = np.linspace(-2, 2, 2 ** 5)
grid_y = np.linspace(-2, 2, 2 ** 4)
grid = np.stack(np.meshgrid(grid_x, grid_y), -1).reshape(-1, 2)

# density estimates
y1 = KDEpy.FFTKDE(bw=0.2).fit(data).evaluate(grid)
y2 = KDEpy.TreeKDE(bw=0.2).fit(data).evaluate(grid)
y3 = KDEpy.NaiveKDE(bw=0.2).fit(data).evaluate(grid)

# plots
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
axes[0, 0].set_title("Data")
axes[0, 0].scatter(data[:, 0], data[:, 1], marker="x", color="red")

axes[0, 1].set_title("FFTKDE")
axes[0, 1].scatter(grid[:, 0], grid[:, 1], c=y1)
axes[0, 1].scatter(data[:, 0], data[:, 1], marker="x", color="red")

axes[1, 0].set_title("TreeKDE")
axes[1, 0].scatter(grid[:, 0], grid[:, 1], c=y2)
axes[1, 0].scatter(data[:, 0], data[:, 1], marker="x", color="red")

axes[1, 1].set_title("Naive")
axes[1, 1].scatter(grid[:, 0], grid[:, 1], c=y3)
axes[1, 1].scatter(data[:, 0], data[:, 1], marker="x", color="red")

print(np.mean(np.abs(y2 - y3)))  # 2.3899e-05
print(np.mean(np.abs(y1 - y3)))  # 0.0945866
print(np.mean(np.abs(y1 - y2)))  # 0.0945963

image

Changing grid creation slightly produces another error

grid = np.stack(np.meshgrid(grid_y, grid_x), -1).reshape(-1, 2)

image

Everything works with the following grid

grid = np.stack(np.meshgrid(grid_y, grid_x), -1).reshape(-1, 2)
grid[:, [0, 1]] = grid[:, [1, 0]] # Swap indices

image

Explanation

The last code works because the grid values are sorted in order of dimensionality. I.e. it's sorted by x_1, then by x_2 in this case. It looks like the following (simplified version):

x_1 x_2
-1 -1
-1 0
-1 1
0 -1
0 0
0 1
1 -1
1 0
1 1

It's required that the grid has this structure to achieve fast linear binning. If I remember correctly: the algorithm must know where to place a data point in the grid in constant time, so it must be able to compute the index immediately. This requires the above grid structure. The (somewhat cryptic) lines of code which perform this is found here:

# (1) To compute the index of this corner, and the value, we must

Solution

We should verify that user-defined grids obey the above structure, i.e. that they are sorted by x_1, x_2, ... ,x_D. With N grid points in each of the D dimensions, there are N^D grid points in total. I believe sorting would require O(N^D log (N^D)) operations. Querying whether or not the grid is sorted by the above key requires O(N^D * D) operations, since for each of the N^D grid points D operations are needed.

I propose to add such a check to user-specified grids. I will also perform some more testing. Thoughts on this @blasern ?

@blasern
Copy link
Author

blasern commented Feb 19, 2019

I still find it slightly inconvenient that I cannot choose the grid I want. As I've said before, I may not even want to use an axis-aligned grid.

On the other hand, as long as this requirement is documented and you check the input grids (perhaps with an option to turn this check off, for speed) I think that's fine.

@tommyod
Copy link
Owner

tommyod commented Feb 19, 2019

Great. 👍

I agree that it's inconvenient to have to use an axis-aligned grid, the discussion about non-axis aligned (yet equally spaced) fast algorithms could continue at Issue #6 . Arbitrary grids are possible using the other algorithms.

Thanks again for your valuable input. I'll leave this issue open until I've patched the software.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants