Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed LBFGS (forward) operator #258

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

paraynaud
Copy link
Member

@dpo, this is the first version of an implementation of a compressed LBFGS (forward) operator.
I made a first structure, as well as a Matrix interface, and a mul! method.
There is a lot of room for improvement, but right now it is functionnal.
I didn't add tests for now.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 13, 2023

Package name latest stable
CaNNOLeS.jl
DCISolver.jl
FletcherPenaltySolver.jl
JSOSolvers.jl
Krylov.jl
NLPModels.jl
NLPModelsModifiers.jl
PROPACK.jl
Percival.jl
QuadraticModels.jl
SolverTools.jl

@paraynaud paraynaud force-pushed the pr-compressed-LBFGS branch 2 times, most recently from 1e51035 to 6c08717 Compare January 18, 2023 14:38
@paraynaud
Copy link
Member Author

paraynaud commented Jan 18, 2023

@dpo, @tmigot, (@amontoison, @geoffroyleconte) I think this PR is pretty mature.
Right now, a CompressedLBFGSOperator is faster to update and to perform mul! than the usual LBFGSOperator.
Here are some @benchmark I ran on my personal laptop.
It means to compare LBFGSOperator and CompressedLBFGSOperator:

n = 100000
m = 15
mem = m 
lbfgs = CompressedLBFGSOperator(n; m)
classic_lbfgs = LBFGSOperator(n; mem)

for i in 1:m #m
  s = rand(n)
  y = rand(n)
  push!(lbfgs, s, y)
  push!(classic_lbfgs, s, y)
end 

s = rand(n)
y = rand(n)
  
@benchmark push!(lbfgs, s, y) 
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m = 5
#  Range (min … max):  145.300 μs …   6.865 ms  ┊ GC (min … max): 0.00% … 96.10%
#  Time  (median):     159.900 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   186.701 μs ± 206.696 μs  ┊ GC (mean ± σ):  3.75% ±  3.42%
#
#   ▇█▇▃▂▂▂▂▆▆▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
#   145 μs           Histogram: frequency by time          440 μs <
#  Memory estimate: 86.28 KiB, allocs estimate: 16.

# BenchmarkTools.Trial: 6184 samples with 1 evaluation. # n=10000,  m=15
#  Range (min … max):  560.800 μs …   9.411 ms  ┊ GC (min … max): 0.00% … 87.77%
#  Time  (median):     733.700 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   796.455 μs ± 299.321 μs  ┊ GC (mean ± σ):  0.95% ±  3.07%
#        ▅▆█▇▇▇▄▃▁ 
#   ▂▂▄▆██████████▇▇▅▅▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
#   561 μs           Histogram: frequency by time         1.55 ms <
#  Memory estimate: 112.86 KiB, allocs estimate: 16.

# BenchmarkTools.Trial: 741 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  5.402 ms … 18.267 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     6.557 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   6.731 ms ±  1.057 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
#          ▂▆█▂▃▄▂ ▁▂▁ ▁▃
#   ▃▂▃▄▅▃████████████████▇▇▆▆▆▅▄▆▅▅▄▃▄▃▃▃▃▃▂▂▂▂▃▂▁▁▃▁▁▁▂▁▁▁▁▂ ▄
#   5.4 ms         Histogram: frequency by time        9.66 ms <
#  Memory estimate: 112.86 KiB, allocs estimate: 16.


@benchmark push!(classic_lbfgs, s, y)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m = 5
#  Range (min … max):  131.300 μs … 783.500 μs  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     135.500 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   153.789 μs ±  51.113 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#   █▆▄▄▂▁▁▁▁▁▂▂▁     ▁▁▁                                         ▁
#   ███████████████▇▇▇██████▇▇▆▇▇▆▇▆▆▆▆▆▆▆▇▇▆▆▆▆▅▅▅▅▅▅▅▅▆▄▅▅▅▃▅▅▄ █
#   131 μs        Histogram: log(frequency) by time        380 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 5073 samples with 1 evaluation. # n=10000, m = 15
#  Range (min … max):  858.900 μs …   2.479 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     895.200 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   981.256 μs ± 176.964 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▆█▄▂▁▁▃▄▂▁▁▁▁▂▁▁  ▁▁▁                                        ▁
#   ███████████████████████████▆▇▇▇▇▇▇▇▆▇▆▆▆▇▆▇▆▆▆▅▆▆▆▆▆▅▄▃▅▆▅▅▅▅ █
#   859 μs        Histogram: log(frequency) by time       1.66 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 246 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  19.177 ms … 27.130 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     19.926 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   20.311 ms ±  1.151 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▁ ▁▁█▃▂▄    
#   ▅█▆████████▆▆▆▄▄▄▄▃▃▄▂▂▃▁▄▁▃▁▄▁▃▂▂▁▃▁▁▂▁▃▂▄▂▁▁▁▂▂▁▂▃▁▁▁▁▁▁▂ ▃
#   19.2 ms         Histogram: frequency by time        24.6 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

Bv = similar(y)
v = ones(n)

@benchmark mul!(Bv, lbfgs, v)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000,  m=5
#  Range (min … max):  63.000 μs …  1.067 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     87.100 μs              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   96.820 μs ± 41.151 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#     ▁▄▇██▇▆▅▅▄▃▃▂▁▁▁                                          ▂
#   ▂▅█████████████████████▇▇▇▇▆▇▇▅▅▆▅▆▄▆▅▆▆▅▆▅▆▆▅▆▆▅▆▅▆▅▆▅▅▆▅▆ █
#   63 μs        Histogram: log(frequency) by time       273 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m=15
#  Range (min … max):  116.300 μs … 801.700 μs  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     128.700 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   134.797 μs ±  29.547 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▄▆██▆▄▃▃▂▂▂▁▁                                                ▂
#   ████████████████▇▇▇▆▇▆▆▆▄▄▆▅▄▆▅▃▅▄▅▅▅▂▄▂▅▄▄▅▅▅▄▄▂▄▄▄▅▃▄▄▄▄▆▃▅ █
#   116 μs        Histogram: log(frequency) by time        295 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.


#  BenchmarkTools.Trial: 3477 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  1.101 ms …   3.437 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     1.235 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   1.412 ms ± 353.924 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▃▇█▂        
#   ▄████▆▄▃▃▃▃▂▃▃▂▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
#   1.1 ms          Histogram: frequency by time         2.6 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark mul!(Bv, classic_lbfgs, v)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000,  m = 5
#  Range (min … max):  109.700 μs … 742.600 μs  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     111.600 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   131.146 μs ±  49.643 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#   █▄▄▃▂ ▁  ▁▁▂▁▁       ▁                                        ▁
#   ████████████████▇▇▇████▇█▇▆▇▆▇▇▇▆▇▇▅▆▆▆▆▆▆▆▆▇▆▅▅▆▅▆▅▅▅▅▅▄▅▅▄▄ █
#   110 μs        Histogram: log(frequency) by time        339 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000,  m = 15
#  Range (min … max):  311.800 μs …  25.272 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     327.500 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   394.421 μs ± 279.993 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#   █▄▃▃▃▃▂▃▄▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁                                      ▁
#   ████████████████████████████▇██▇██▇▇▆▆▇▆▇▆▇▆▆▆▆▆▆▆▆▆▅▆▅▅▆▄▄▄▅ █
#   312 μs        Histogram: log(frequency) by time        905 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 951 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  4.579 ms …   9.967 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     5.061 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   5.241 ms ± 560.678 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#      ▃█▅▄▂▃▁        
#   ▂▃▇███████▇▆▆▄▆▅▆▆▅▄▄▅▅▅▄▄▄▅▄▄▄▄▃▂▃▃▂▃▂▂▃▂▂▁▁▂▂▂▁▂▁▂▁▁▂▁▁▂▁ ▃
#   4.58 ms         Histogram: frequency by time        6.88 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

Some allocations remain in push!(lbfgs,s,y), due to several mandatory matrix inversion (considering n == mem).

Next, I plan to include some tests for the CUDA architecture.
I tried to design my code to work for both CPU and GPU.

@geoffroyleconte
Copy link
Member

Hi, thanks for this PR.
Usually in this package the functions to perform matrix-vector products are implemented in prod!, for example:

prod! = @closure (res, x, α, β) -> lbfgs_multiply(res, lbfgs_data, x, α, β)

Is it possible to make a similar change (your mul! function should go to prod!)?

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example:

LinearOperator(M::Symmetric{T}, S = Vector{T}) where {T <: Real} =

@amontoison
Copy link
Member

It's more generic if you add an argument with the storage type S.
It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.


# step 7
mul!(Bv, view(op.Yₖ, :, 1:op.k), view(op.sol, 1:op.k))
mul!(Bv, view(op.Sₖ, :, 1:op.k), view(op.sol, op.k+1:2*op.k), - op.α, (T)(-1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mul!(Bv, view(op.Sₖ, :, 1:op.k), view(op.sol, op.k+1:2*op.k), - op.α, (T)(-1))
mul!(Bv, view(op.Sₖ, :, 1:op.k), view(op.sol, op.k+1:2*op.k), - op.α, -one(T))

@paraynaud
Copy link
Member Author

Hi, thanks for this PR. Usually in this package the functions to perform matrix-vector products are implemented in prod!, for example:

prod! = @closure (res, x, α, β) -> lbfgs_multiply(res, lbfgs_data, x, α, β)

Is it possible to make a similar change (your mul! function should go to prod!)?

Adding prod! is not an issue.
Since prod is a 5 arguments method, I will add it similarly to a LBFGSOperator as a new field of the structure.

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example:

It's more generic if you add an argument with the storage type S.
It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.

The user can specify directly the type he needs, for both the <:AbstractVector and the <:AbstractMatrix types, but I ease its job by determining which one use depending the architecture.
Removing CUDA will make the operator more complicate to any new user.
Another advantage, is that you don't need different test files for buildkite.

@amontoison
Copy link
Member

amontoison commented Jan 19, 2023

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example:
It's more generic if you add an argument with the storage type S.
It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.

The user can specify directly the type he needs, for both the <:AbstractVector and the <:AbstractMatrix types, but I ease its job by determining which one use depending the architecture. Removing CUDA will make the operator more complicate to any new user. Another advantage, is that you don't need different test files for buildkite.

You suppose that GPU == Nvidia but If tomorrow we want to use the same operator on Intel GPUs, will you add oneAPI.jl as a dependency?
I don't understand your argument about test files for buildkite. If you want you can use buildkite to run all tests of LinearOperators if you can.

@paraynaud
Copy link
Member Author

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example:
It's more generic if you add an argument with the storage type S.
It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.

The user can specify directly the type he needs, for both the <:AbstractVector and the <:AbstractMatrix types, but I ease its job by determining which one use depending the architecture. Removing CUDA will make the operator more complicate to any new user. Another advantage, is that you don't need different test files for buildkite.

You suppose that GPU == Nvidia

For now, yes.
But later, if necessary, I would improve default_gpu method as well as the structures defined accordingly.

but If tomorrow we want to use the same operator on Intel GPUs, will you add oneAPI.jl as a dependency?

I guess it will need oneAPI as a dependency.

I don't understand your argument about test files for buildkite. If you want you can use buildkite to run all tests of LinearOperators you can.

At the end, It demultipliates GPU packages as dependencies, but the exact same code will work on every architecture.

Personnally, I think it is better if the user doen't need to complete the type of the data-structures, but the drawback is having GPU packages as dependencies :/

@geoffroyleconte
Copy link
Member

geoffroyleconte commented Jan 19, 2023

I agree with @amontoison here. Maybe you could use Requires.jl (see
https://github.com/JuliaSmoothOptimizers/RipQP.jl/blob/d69c91618b95731f32cd2701f945bad1c254e3f6/src/RipQP.jl#L19) if there is no other choice, but I feel like it could be avoided if you make some minor changes and document well your operator.

If it is not possible, I suggest you make this PR work on the CPU only, and once it is merged we can discuss the GPU compatibility in another PR.

@paraynaud
Copy link
Member Author

if you make some minor changes and document well your operator

It is sure, but I think it is the user's loss.
I am thinking about Requires, but not sure how to make it work.

If it is not possible, I suggest you make this PR work on the CPU only, and once it is merged we can discuss the GPU compatibility in another PR.

I don't mind, even if I don't like it.
But I prefer a clear answer on this one, rather than implementing this scheme to my other modules for less than nothing ;)

@geoffroyleconte
Copy link
Member

Then I would advice not adding CUDA in the dependencies.

We used the following multiple times in this package:

LinearOperator(M::Symmetric{T}, S = Vector{T}) where {T <: Real} =

Using what you propose would mean changing all these functions too if we want to have a coherent API for the whole package.

@paraynaud
Copy link
Member Author

Then I would advice not adding CUDA in the dependencies.

We used the following multiple times in this package:

LinearOperator(M::Symmetric{T}, S = Vector{T}) where {T <: Real} =

It would be better, the user would not have to change manually S in case where the architecture lauching the code is a GPU.

My motivation about implementating CompressedLBFGSOperator is the a current implementation of LBFGSOperator (forward) does not translate to GPU. So not everything is the module would need change.

@amontoison
Copy link
Member

amontoison commented Jan 20, 2023

Then I would advice not adding CUDA in the dependencies.
We used the following multiple times in this package:

LinearOperator(M::Symmetric{T}, S = Vector{T}) where {T <: Real} =

It would be better, the user would not have to change manually S in case where the architecture lauching the code is a GPU.

My motivation about implementating CompressedLBFGSOperator is the a current implementation of LBFGSOperator (forward) does not translate to GPU. So not everything is the module would need change.

I don't see a big difference bewteen T=Float64, gpu=true and S=CuVector{Float64} sincerely.
I'm not sure that default_gpu() should return true if you have a gpu available. If you use it in optimization methods that only working on cpu, you will some errors or slow conversions.

@paraynaud
Copy link
Member Author

I don't see a big différence bewteen T=Float64, gpu=true and S=CuVector{Float64} sincerely.

My code needs neither of gpu=true nor S=CuVector{Float64}; T=Float64 being by default.
If you want to switch to T=Float32, M and S will change accordingly to T.

I'm not sure that default_gpu() should return true if you have a gpu available.

If this corner case happens, the user will give its desired M and S types.

If you use it in optimization methods that only working on cpu, you will some errors or slow conversions.

I am not sure why I will have errors or slow conversions, which would not be the case if I implement it the other way.

Willing to have structures not related to the architecture, is for me the small minority of usages.

@paraynaud paraynaud force-pushed the pr-compressed-LBFGS branch 2 times, most recently from 5f48fc8 to 4a293cc Compare January 31, 2023 10:17
@paraynaud
Copy link
Member Author

@geoffroyleconte I added Requires and removed CUDA from the dependencies, and the tests pass (for the CPU.
To not mess up the __init__ of LinearOperators, I made a submodule ModCompressedLBFGSOperator and I use Require in it.

@amontoison unfortunately, I got (at least) one issue with the GPU :/

@amontoison
Copy link
Member

Paul, could you test with the branch master of CUDA.jl?
You just need to do a modification similar to
https://github.com/JuliaSmoothOptimizers/Krylov.jl/blob/main/.buildkite/pipeline.yml#L33

@paraynaud
Copy link
Member Author

It seems to work on GPU now, but it struggles to invert 3 intemediary matrices :/
These invertions should cap the performances of the method.

@dpo
Copy link
Member

dpo commented Jan 31, 2023

Let's not add submodules, please.

@geoffroyleconte
Copy link
Member

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

src/compressed_lbfgs.jl Outdated Show resolved Hide resolved
src/compressed_lbfgs.jl Outdated Show resolved Hide resolved
src/compressed_lbfgs.jl Outdated Show resolved Hide resolved
@paraynaud
Copy link
Member Author

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

It may use other GPU backends, and it can be done automatically by adding 2 lines in the Requires part (for each GPU bakend).
In the worst case, the user defines the structures he wants with the optionnal parameters.
Having the same code working for any architecture is an non-negligeable asset for me, especially for tutorials and demonstrations!

@codecov
Copy link

codecov bot commented Jan 31, 2023

Codecov Report

Base: 97.32% // Head: 96.18% // Decreases project coverage by -1.14% ⚠️

Coverage data is based on head (11e3b34) compared to base (9591943).
Patch coverage: 82.79% of modified lines in pull request are covered.

❗ Current head 11e3b34 differs from pull request most recent head 341332f. Consider uploading reports for the commit 341332f to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #258      +/-   ##
==========================================
- Coverage   97.32%   96.18%   -1.14%     
==========================================
  Files          14       15       +1     
  Lines        1009     1102      +93     
==========================================
+ Hits          982     1060      +78     
- Misses         27       42      +15     
Impacted Files Coverage Δ
src/compressed_lbfgs.jl 82.79% <82.79%> (ø)
src/utilities.jl 96.47% <0.00%> (+1.17%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@amontoison
Copy link
Member

amontoison commented Jan 31, 2023

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

It may use other GPU backends, and it can be done automatically by adding 2 lines in the Requires part (for each GPU bakend). In the worst case, the user defines the structures he wants with the optionnal parameters. Having the same code working for any architecture is an non-negligeable asset for me, especially for tutorials and demonstrations!

Paul, the code will also work for any architecture is you use what we suggest with Geoffroy.

@paraynaud
Copy link
Member Author

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

It may use other GPU backends, and it can be done automatically by adding 2 lines in the Requires part (for each GPU bakend). In the worst case, the user defines the structures he wants with the optionnal parameters. Having the same code working for any architecture is an non-negligeable asset for me, especially for tutorials and demonstrations!

Paul, the code will also work for any architecture is you use what we suggest with Geoffroy.

I know.
I don't understand the issue, is it because I added Requires as a dependency?
If the purpose of the tools we provide is to be democratized, what I propose is really usefull with no drawback now that CUDA is not a dependency anymore.
Between:

CompressedLinearOperator(n)

runnable on any architecture and

CompressedLinearOperator(n; T=..., V=CUDA.CuVector{T, CUDA.Mem.DeviceBuffer}, M=CUDA.CuMatrix{T, CUDA.Mem.DeviceBuffer})

that should be changed for each architecture, the first option is clearly better (for me).

@amontoison
Copy link
Member

amontoison commented Feb 1, 2023

The issue is that I didn't understand what you want to achieve.
Your previous message helped me to understand.

Your example has a big drawback, you will have a different behavior depending if you have a GPU or not and the user doesn't have control on it. We practice we don't want to always use an operator or routine specialized for GPUs.
Julia packages with GPU support are not democratized and it will never be the case (from my point of view) with the constraints related to GPU programming.

About your previous message, the type of the GPU buffer is not required in the constructor.
T could have a default value related to S or M argument because you can recover it with eltype(S) or eltype(M).
You can also only provide S or M with this function: https://github.com/JuliaSmoothOptimizers/Krylov.jl/blob/main/src/krylov_utils.jl#L241

So, the question is do we prefer:

CompressedLinearOperator(n)
# Which GPU backend will be used in the future if you have an Intel CPU and a Nvidia GPU because OneAPI.jl and CUDA.jl are both installed?
# Do we need to provide all options if we have a Nvidia GPU but we don't want to use it?

CompressedLinearOperator(n, S=CuVector{T})

@paraynaud
Copy link
Member Author

paraynaud commented Feb 1, 2023

About your previous message, the type of the GPU buffer is not required in the constructor.

I tried without it, but it failed to type correctly

Dₖ = Diagonal(V(undef, mem))

and adding CUDA.Mem.DeviceBuffer made it work.

Your example has a big drawback, you will have a different behaviour depending if you have a GPU or not and the user doesn't have control on it. We practice we don't want to always use an operator or routine specialized for GPUs

There is no drawback because the user has still control

CompressedLBFGSOperator(n::Int; mem::Int=5, T=Float64, M=default_matrix_type(; T), V=default_vector_type(; T))

the trick is that default_matrix_type and default_vector_type return Matrix and Vector by default, but the Requires part overwrite those two methods depending on the modules loaded.
If you don't want to rely on the default M and V, you just have to parametrize M and V, replacing default_matrix_type and default_vector_type, as Geoffroy suggested, but I believe that most of the time you should not have to use it.

T could have a default value related to S or M argument because you can recover it with eltype(S) or eltype(M).

I followed the LBFGSOperator API, which has also T as a parameter.

If there is several GPU backend modules loaded, fair enough, the user will have to type M and V manually.

@geoffroyleconte
Copy link
Member

The LBFGS Operator in this package uses scalar indexing so it is not made for GPUs anyways.
As I mentionned in a previous message, the API in this package currently uses this:

function LinearOperator(M::SymTridiagonal{T}, S = Vector{T}) where {T}

(you can use S = CuArray{Float64, 1, CUDA.Mem.DeviceBuffer} for example).
Having multiple ways to do the same thing for different operators is confusing. I am not against changing the API for the whole package but I think that this should not be done in this PR.

Comment on lines +237 to +238
view(data.inverse_intermediate_1, 1:2*data.k, 1:2*data.k) .= inv(data.intermediate_1[1:2*data.k, 1:2*data.k])
view(data.inverse_intermediate_2, 1:2*data.k, 1:2*data.k) .= inv(data.intermediate_2[1:2*data.k, 1:2*data.k])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inv has been used to wait until a better solution is found.
The function is performed only when an update occurs.
The dimension of the matrix inverted is related to m and not to n.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use LAPACK.getrf! and LAPACK.getri! on data.intermediate_1[1:2*data.k, 1:2*data.k] and data.intermediate_1[1:2*data.k, 1:2*data.k].
getrf! computes a dense LU decomposition in-place.
getri! uses the factors of the LU decomposition to compute the inverse.

I added a dispatch in CUDA.jl and AMDGPU.jl for these LAPACK calls.
https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusolver/dense.jl#L895-L926

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants