Avoid NaN (co)tangents for sqrt(0) #599

sethaxen · 2022-03-12T12:18:21Z

This PR fixes #576 by treating zero (co)tangents in sqrt as strong zeros.

It partially fixes FluxML/Zygote.jl#1101 also, but to fix it entirely, we would need to do the same thing to the rule for ^.

Benchmark

This simple benchmark indicates that the performance decrease from this modified rule in Zygote is not extreme.

julia> using Zygote, BenchmarkTools, Random

julia> x = zeros(1_000);

julia> y = rand(MersenneTwister(42), 1_000);

julia> f(x) = sum(x -> max(sqrt(x), 1), x)
f (generic function with 1 method)

julia> Zygote.gradient(f, x)
([NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN  …  NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN],)

julia> b1_1 = @benchmark $(Zygote.gradient)($f, $x)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   8.663 μs … 730.817 μs  ┊ GC (min … max):  0.00% … 93.96%
 Time  (median):      9.755 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   13.060 μs ±  31.542 μs  ┊ GC (mean ± σ):  15.00% ±  6.24%

  ▅██▆▇▆▅▄▃▂▁▂▂▂▂▂▁▁▁▁▁▂▂▂▁▁                       ▁▁▁         ▂
  ███████████████████████████▇█▇▇▇▆▇▇▆▆▄▆▆▆▆▇█▇▇▆▆▆████▇▆▅▄▇▆▆ █
  8.66 μs       Histogram: log(frequency) by time        26 μs <

 Memory estimate: 71.22 KiB, allocs estimate: 31.

julia> b2_1 = @benchmark $(Zygote.gradient)($f, $y)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   8.612 μs … 532.969 μs  ┊ GC (min … max):  0.00% … 93.85%
 Time  (median):      9.335 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   12.282 μs ±  30.078 μs  ┊ GC (mean ± σ):  16.07% ±  6.44%

  ▅██▆▆▆▅▃▂ ▁  ▂▂▂▂▁▁     ▁                                    ▂
  ██████████████████████████████▇▇▆▆▅▆▆▄▇▅▄▅▅▆▅▅▆▅▅▆▆▆▆▅▅▅▄▄▅▅ █
  8.61 μs       Histogram: log(frequency) by time      23.9 μs <

 Memory estimate: 71.22 KiB, allocs estimate: 31.

julia> function ChainRulesCore.frule((_, Δx), ::typeof(sqrt), x::Number)
           Ω = sqrt(x)
           ∂Ω = Δx / 2Ω
           return Ω, ifelse(iszero(Δx) & iszero(x), zero(∂Ω), ∂Ω)
       end

julia> function ChainRulesCore.rrule(::typeof(sqrt), x::Number)
           Ω = sqrt(x)
           function sqrt_pullback(ΔΩ)
               ∂x = ΔΩ / 2conj(Ω)
               return (
                   NoTangent(),
                   ProjectTo(x)(ifelse(iszero(ΔΩ) & iszero(x), zero(∂x), ∂x))
               )
           end
           return Ω, sqrt_pullback
       end

julia> Zygote.gradient(f, x)
([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],)

julia> b1_2 = @benchmark $(Zygote.gradient)($f, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   8.891 μs …  1.357 ms  ┊ GC (min … max):  0.00% … 96.56%
 Time  (median):      9.832 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   12.484 μs ± 46.625 μs  ┊ GC (mean ± σ):  15.40% ±  4.09%

   ▆██▆▅▃▃▃▃▂▁ ▁▁▁                                            ▂
  ▇████████████████▇▇█▇▆▄▃▄▃▃▄▄▄▅▃▄▂▄▃▂▄▄▃▄▄▄▂▄▄▃▄▅▄▅▄▆▇▇▆▇▆▆ █
  8.89 μs      Histogram: log(frequency) by time      25.3 μs <

 Memory estimate: 86.84 KiB, allocs estimate: 31.

julia> b2_2 = @benchmark $(Zygote.gradient)($f, $y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.066 μs …  1.304 ms  ┊ GC (min … max):  0.00% … 96.15%
 Time  (median):      9.892 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   11.970 μs ± 43.215 μs  ┊ GC (mean ± σ):  14.43% ±  3.96%

     ▁▆██▆▄▁                                                   
  ▂▃▅████████▆▅▄▄▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂ ▃
  9.07 μs         Histogram: frequency by time        15.7 μs <

 Memory estimate: 86.84 KiB, allocs estimate: 31.

julia> judge(mean(b1_2), mean(b1_1))
BenchmarkTools.TrialJudgement: 
  time:   -4.41% => invariant (5.00% tolerance)
  memory: +21.94% => regression (1.00% tolerance)


julia> judge(mean(b2_2), mean(b2_1))
BenchmarkTools.TrialJudgement: 
  time:   -2.54% => invariant (5.00% tolerance)
  memory: +21.94% => regression (1.00% tolerance)

devmotion · 2022-03-12T12:26:01Z

src/rulesets/Base/fastmath_able.jl

+        function frule((_, Δx), ::typeof(sqrt), x::Number)
+            Ω = sqrt(x)
+            ∂Ω = Δx / 2Ω
+            return Ω, ifelse(iszero(Δx) & iszero(x), zero(∂Ω), ∂Ω)


Is there a specific reason to use ifelse instead of a ternary operator (which does not require to evaluate both branches)?

My reasoning was that so long as the type of ∂Ω is inferrable, the two branches do no extra work, and the use of & and ifelse both could perform better if this is used in an inner loop and potentially allow Zygote to perform better for higher order AD (since Zygote tends so be slow when hitting control flow but has a special rule for ifelse). However, I was unable to devise a benchmark that showed a substantial difference.

I know that in older Julia versions there were cases where we could improve performance in SciML by avoiding zero or moving it out of loops. But I couldn't reproduce this with a simple example immediately, maybe it's not relevant here and/or fixed in recent Julia versions.

An example where it matters:

julia> using BenchmarkTools julia> function f(x) s = zero(x) for i in 1:10 s += iseven(i) ? zero(x) : x end return s end f (generic function with 1 method) julia> function g(x) s = zero(x) for i in 1:10 s += ifelse(iseven(i), zero(x), x) end return s end g (generic function with 1 method) julia> @btime f($(big"1.0")); 45.640 μs (3002 allocations: 164.17 KiB) julia> @btime g($(big"1.0")); 56.341 μs (4002 allocations: 218.86 KiB)

mcabbott · 2022-03-12T16:09:38Z

This seems fine. How many other functions will need this? cbrt is currently a scalar rule. Powers are their own messy thing.

sethaxen · 2022-03-12T19:46:04Z

This seems fine. How many other functions will need this? cbrt is currently a scalar rule. Powers are their own messy thing.

At first glance, /, \, ^, inv, sqrt, cbrt, log, log2, log10, log1p, and a bunch of inverse trig/hyperbolic functions.

I'm reluctant to add custom frules/rrules for all of these without first at least checking if we see a significant performance decrease by making zero (co)tangents strong zeros in the @scalar_rule macro, so perhaps before merging this I should open a PR on ChainRulesCore with a benchmark.

sethaxen · 2022-03-13T20:25:05Z

I opened a PR to ChainRulesCore that would supersede this one if merged: JuliaDiff/ChainRulesCore.jl#551

mcabbott · 2022-03-13T22:43:15Z

Functions like inv, log etc. are a slightly different class to sqrt, since the primal is infinite.

The motivating case for sqrt is I think something like f(x) = sqrt(x^2 + 0), which is regular at zero, and can be made to have a continuous derivative there. Is there something like that for inv, less trivial than g(x) = inv(inv(x))?

sethaxen · 2022-03-14T00:34:06Z

Functions like inv, log etc. are a slightly different class to sqrt, since the primal is infinite.

Is this difference important though? There are plenty of cases where in a well-behaved primal function intermediate can be non-finite, resulting in introduction of NaNs. Here's another one that hits users of lower-truncated normal distributions in Turing:

julia> using StatsFuns

julia> normcdf(0.0, 1.0, Inf)  # a constant function for all finite values of mu and sigma
1.0

julia> FiniteDifferences.grad(central_fdm(5, 1), x -> normcdf(0.0, x, Inf), 1.0)
(6.085449991639748e-14,)

julia> Zygote.gradient(x -> normcdf(0.0, x, Inf), 1.0)
(NaN,)

This happens because the gradient of erfc at Inf is 0, but when that gets pulled back through (x - mu)/sigma for x=Inf, we have an infinite partial for /, so a NaN is introduced. This case is also resolved by treating zero (co)tangents as hard zeros.

The motivating case for sqrt is I think something like f(x) = sqrt(x^2 + 0), which is regular at zero, and can be made to have a continuous derivative there. Is there something like that for inv, less trivial than g(x) = inv(inv(x))?

Perhaps, but I don't see inv(inv(x)) (or log(exp(x))) as being any more trivial than sqrt(x^2).

sethaxen added 2 commits March 11, 2022 23:25

Special-case sqrt(0)

55be25d

Test sqrt(0)

e43c03b

sethaxen requested a review from mcabbott March 12, 2022 12:18

github-actions bot added the needs version bump Version needs to be incremented or set to -DEV in Project.toml label Mar 12, 2022

devmotion reviewed Mar 12, 2022

View reviewed changes

devmotion mentioned this pull request Mar 13, 2022

Avoid NaN-propagation in scalar rules JuliaDiff/ChainRulesCore.jl#551

Open

mcabbott mentioned this pull request Jan 12, 2023

Error differentiating composed cross product with Zygote #689

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid NaN (co)tangents for sqrt(0) #599

Avoid NaN (co)tangents for sqrt(0) #599

sethaxen commented Mar 12, 2022

devmotion Mar 12, 2022

sethaxen Mar 12, 2022

devmotion Mar 13, 2022

devmotion Mar 13, 2022 •

edited

Loading

mcabbott commented Mar 12, 2022

sethaxen commented Mar 12, 2022

sethaxen commented Mar 13, 2022

mcabbott commented Mar 13, 2022

sethaxen commented Mar 14, 2022

Avoid NaN (co)tangents for sqrt(0) #599

Are you sure you want to change the base?

Avoid NaN (co)tangents for sqrt(0) #599

Conversation

sethaxen commented Mar 12, 2022

Benchmark

devmotion Mar 12, 2022

Choose a reason for hiding this comment

sethaxen Mar 12, 2022

Choose a reason for hiding this comment

devmotion Mar 13, 2022

Choose a reason for hiding this comment

devmotion Mar 13, 2022 • edited Loading

Choose a reason for hiding this comment

mcabbott commented Mar 12, 2022

sethaxen commented Mar 12, 2022

sethaxen commented Mar 13, 2022

mcabbott commented Mar 13, 2022

sethaxen commented Mar 14, 2022

devmotion Mar 13, 2022 •

edited

Loading