Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any plans for GPU support? #17

Closed
hakkelt opened this issue Jun 30, 2020 · 30 comments
Closed

Is there any plans for GPU support? #17

hakkelt opened this issue Jun 30, 2020 · 30 comments

Comments

@hakkelt
Copy link
Contributor

hakkelt commented Jun 30, 2020

Just to put this question into context: Recently I developed a similar package (https://github.com/hakkelt/FunctionOperators.jl), unfortunately, before I realized that such a package already exists (which is even more annoying having that I spend some time searching for such packages before I started to develop that package...). As AbstractOperators seems a quite good package and is definitely more mature and feature-rich than mine, so I would be willing to contribute to this package to add GPU support (or, more specifically, support for CuArray input and output), rather than continue developing FunctionOperators.

I looked at the source code, and it appears to me that the main design barrier of GPU support is that the calculus rules allocate buffers upon parsing the expression, but it turns out only later, when the operator is applied to the input array, whether the input is a CPU or a GPU array. To avoid this problem for FunctionOperators, I allocated the buffers only when the operator is first applied to the inputs, and later this buffer is re-used, of course. I think this method would be viable here as well, and it needs only a reasonable amount of refactoring.

@nantonel
Copy link
Member

nantonel commented Jul 2, 2020

Hi @hakkelt, firstly to answer your question, is there is any plan for GPU support? well let's say there is no official one 😄
It would be great to make that possible though.

My experience with GPU programming is rather limited, however as far as I know one of the bottleneck of GPU programming is moving memory. So I believe efficient GPU processing could be reach only if all mul! are extended to allow CuArray both at input and output.

I'm not sure you can always avoid allocating/share buffers, in the approach we've taken here these buffers need to be stored in the forward pass in order to compute a successful backward pass and obtain gradients. This is strictly necessary when nonlinear operators are involved. Probably you can share buffers within linear operators as long as dimensions are consistent.

Of course contributions are very welcome! 😉

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 7, 2020

Hi @nantonel,
sorry, I described my idea in a bit confusing way. I totally agree that it is not possible to avoid the allocation of buffers, but it is possible to defer the allocation until the first use of the operator, and then "cache" the allocated buffer for later use. For example, let's consider the following sections from Compose.jl:

struct Compose{N, M, L<:NTuple{N,Any}, T<:NTuple{M,Any}} <: AbstractOperator
    A::L
    buf::T
end

function Compose(L1::AbstractOperator, L2::AbstractOperator)
    if size(L1,2) != size(L2,1)
    	throw(DimensionMismatch("cannot compose operators"))
    end
    if domainType(L1) != codomainType(L2)
    	throw(DomainError())
    end
    Compose( L1, L2, Array{domainType(L1)}(undef,size(L2,1)) )
end

[...]

@generated function mul!(y::C, L::Compose{N,M,T1,T2},b::D) where {N,M,T1,T2,C,D}
    ex = :(mul!(L.buf[1],L.A[1],b))
    for i = 2:M
    	ex = quote
    	    $ex
    	    mul!(L.buf[$i],L.A[$i], L.buf[$i-1])
        end
    end
    ex = quote
    	$ex
    	mul!(y,L.A[N], L.buf[M])
    	return y
    end
end

We can rewrite this code to the following to defer the allocation:

mutable struct Compose{N, M, L<:NTuple{N,Any}, T<:NTuple{M,Any}} <: AbstractOperator
    A::L
    buf::Union{Nothing,T}  # I modified the code HERE
end

function Compose(L1::AbstractOperator, L2::AbstractOperator)
    if size(L1,2) != size(L2,1)
    	throw(DimensionMismatch("cannot compose operators"))
    end
    if domainType(L1) != codomainType(L2)
    	throw(DomainError())
    end
    Compose( L1, L2, nothing )   # I modified the code HERE
end

[...]

@generated function mul!(y::C, L::Compose{N,M,T1,T2},b::D) where {N,M,T1,T2,C,D}
    ex = quote
        L.buf[1] isa Nothing && ( L.buf[1] = Array{codomainType(L.A[1])}(undef, size(L.A[1],1)) )  # I modified the code HERE
        mul!(L.buf[1],L.A[1],b))
    end
    for i = 2:M
    	ex = quote
    	    $ex
            L.buf[$i] isa Nothing && ( L.buf[$i] = Array{codomainType(L.A[$i])}(undef, size(L.A[$i],1)) )  # I modified the code HERE
            mul!(L.buf[$i],L.A[$i], L.buf[$i-1])
        end
    end
    ex = quote
    	$ex
        mul!(y,L.A[N], L.buf[M])
        return y
    end
end

I didn't run this code, but I think this approach should work (of course, after defining the backward mul! analogously). The advantage of this method would be that we can overload the mul! function to support GPU arrays:

@generated function mul!(y::CuArray, L::Compose{N,M,T1,T2},b::CuArray) where {N,M,T1,T2}
    ex = quote
        L.buf[1] isa Nothing && ( L.buf[1] = CuArray{codomainType(L.A[1])}(undef, size(L.A[1],1)) )
        mul!(L.buf[1],L.A[1],b))
    end
    for i = 2:M
        ex = quote
    	    $ex
            L.buf[$i] isa Nothing && ( L.buf[$i] = CuArray{codomainType(L.A[$i])}(undef, size(L.A[$i],1)) )
            mul!(L.buf[$i],L.A[$i], L.buf[$i-1])
    	end
    end
    ex = quote
    	$ex
    	mul!(y,L.A[N], L.buf[M])
    	return y
    end
end

I'm not sure if it satisfies the criterion you mentioned ("these buffers need to be stored in the forward pass in order to compute a successful backward pass and obtain gradients"), but if it does and you don't see any obvious flaws which causes troubles in other components of the package (or in the other packages built upon it), then I would create a PR to see whether it works in practice or not. ☺️

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 7, 2020

Btw it would mean that once an operator is applied on GPU arrays, it cannot be used on CPU arrays any more and vice versa. But it is possible to tackle this problem by modifying the definition of the struct to have two separate list of buffers: one for CPU buffers and the other for GPU buffers.

@nantonel
Copy link
Member

nantonel commented Jul 8, 2020

If it's only about the possibility of changing the type of array of buf this can be done quite easily, e.g. by modifying the constructor:

function Compose(L1::AbstractOperator, L2::AbstractOperator; T=Array)
    if size(L1,2) != size(L2,1)
    	throw(DimensionMismatch("cannot compose operators"))
    end
    if domainType(L1) != codomainType(L2)
    	throw(DomainError())
    end
    Compose( L1, L2, T{domainType(L1)}(undef,size(L2,1)) )
end

then you can set T=CuArray.

This also preserves Compose to be immutable and avoids any modification of mul!(y::C, L::Compose{N,M,T1,T2},b::D).

All is needed is to define mul!(y::CuArray,A::L,x::CuArray) and mul!(y::CuArray,A::Adjoint{L},x::CuArray) and for all L appearing in Compose.

Regarding using both CPU and GPU perhaps another linear operator could be introduced, similar to Eye, which simply accepts CuArray and returns an Array.

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 8, 2020

Yes, it is also a solution, but I prefer the other way because it allows the end-users to switch from CPU to GPU without changing their code and it works also for shorthand constructors. In contrast, it might confuse the users that only calculus rules need to define explicitly the input-output type in their constructor (I think the linear and non-linear operators only need the overloading of their mul! function, and it makes no sense to change their constructor).

I don't think the deferred allocation is much more difficult to implement. On the other hand, your concern about making the structs mutable might be a more important question, but why shouldn't we do that? In my opinion, the performance hit would be insignificant. And if the possibility of changing other fields is your main concern, then it is possible to make an auxiliary mutable wrapper type to hold the buffer or nothing, so we can leave the Compose and other structs immutable (although it would complicate the code, so I would rather avoid that). For example:

mutable struct Buffer
    b::Union{Nothing,Any}
end

struct Compose{N, M, L<:NTuple{N,Any}, T<:NTuple{M,Buffer}} <: AbstractOperator
    A::L
    buf::T
end

Or are there any other issues with mutability?

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 8, 2020

And another argument for the deferred allocation: It is much easier to decide in the mul! function whether we need a new allocation or we can re-use a previously allocated buffer. For example:

@generated function mul!(y::C, L::Compose{N,M,T1,T2},b::D) where {N,M,T1,T2,C,D}
    ex = quote
        L.buf[1] isa Nothing && ( L.buf[1] = Array{codomainType(L.A[1])}(undef, size(L.A[1],1)) )
        mul!(L.buf[1],L.A[1],b))
    end
    for i = 2:M
    	ex = quote
    	    $ex
            if L.buf[$i] isa Nothing
                L.buf[$i] = ( size(L.A[$i],1) == size(L.A[$i-1],1) ? L.buf[$i-1] : Array{codomainType(L.A[$i])}(undef, size(L.A[$i],1)) )
            end
            mul!(L.buf[$i],L.A[$i], L.buf[$i-1])
        end
    end
    ex = quote
    	$ex
        mul!(y,L.A[N], L.buf[M])
        return y
    end
end

Ok, this solution is not the best because some operators don't work in-place (or at least they need some modification to perform the operation in place, like DFT), but I know a better (but slightly more complex) solution that worked quite well for FunctionOperators.jl, and it benefits from the fact that more information is available at the time of mul! then at construction. Anyway, my point with that example is that delaying the allocation opens the door for further optimization.

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 8, 2020

(A bit off-topic question by the way: What is the purpose of M in Compose{N,M,T1,T2}? Isn't it always M = N - 1?)

@nantonel
Copy link
Member

nantonel commented Jul 8, 2020

(A bit off-topic question by the way: What is the purpose of M in Compose{N,M,T1,T2}? Isn't it always M = N - 1?)

I actually don't remember! 😅 but this is off topic, feel free to open another issue.

@nantonel
Copy link
Member

nantonel commented Jul 8, 2020

Yes, it is also a solution, but I prefer the other way because it allows the end-users to switch from CPU to GPU without changing their code and it works also for shorthand constructors.

Well this package is not entirely thought for end users... the end-user package is StructuredOptimization.jl.

In contrast, it might confuse the users that only calculus rules need to define explicitly the input-output type in their constructor (I think the linear and non-linear operators only need the overloading of their mul! function, and it makes no sense to change their constructor).

Some operators need buffers for themselves so I don't see how this is inevitable...

I don't think the deferred allocation is much more difficult to implement. On the other hand, your concern about making the structs mutable might be a more important question, but why shouldn't we do that? In my opinion, the performance hit would be insignificant. And if the possibility of changing other fields is your main concern, then it is possible to make an auxiliary mutable wrapper type to hold the buffer or nothing, so we can leave the Compose and other structs immutable (although it would complicate the code, so I would rather avoid that). For example:

mutable struct Buffer
    b::Union{Nothing,Any}
end

struct Compose{N, M, L<:NTuple{N,Any}, T<:NTuple{M,Buffer}} <: AbstractOperator
    A::L
    buf::T
end

Or are there any other issues with mutability?

not sure, but mutable objects are more "dangerous"...

Overall I don't understand why you want to switch between CPU and GPU... isn't it more efficient to stick to one of the two? As far as I know softwares like pytorch do not provide this possibility (possibly because transferring data from GPU and CPU introduces substantial delays). But as I said already, I'm not an expert on GPU programming.

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 8, 2020

Well this package is not entirely thought for end users... the end-user package is StructuredOptimization.jl.

Ok, that's fair enough, but then, unfortunately, this package might not meet my needs. I'm working on optimization methods, too, but currently I'm rather just prototyping and trying out new algorithms, so the other packages in the KUL forBES group are too high-level for me.

Overall I don't understand why you want to switch between CPU and GPU... isn't it more efficient to stick to one of the two? As far as I know softwares like pytorch do not provide this possibility (possibly because transferring data from GPU and CPU introduces substantial delays). But as I said already, I'm not an expert on GPU programming.

Oh, I see now what you mean! No, I don't want to switch back and forth between CPU and GPU inside one algorithm. What I want is to implement the algorithm on CPU first, and when it runs without error, then I can switch to GPU arrays, following the recommended workflow of CuArrays. Ideally, when you have a working CPU implementation, all you need is to replace Arrays to CuArrays, and then CUDA.jl (or OpenCL.jl) does the rest. In practice, however, some small parts of the code need to be re-implemented for GPU to achieve optimal performance, but it is still only a minor part of the entire code base. So actually, I think that having a smooth transition from CPU to GPU would also very useful for other packages in the KUL forBES, if you plan to add GPU support at some point in the future.

By the way, the interface of pytorch is intentionally designed to be almost identical to the interface of numpy, for the same reason (it is recommended to write your code using numpy first, and then you can easily switch to pytorch).

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 8, 2020

(A bit off-topic question by the way: What is the purpose of M in Compose{N,M,T1,T2}? Isn't it always M = N - 1?)

I actually don't remember! but this is off topic, feel free to open another issue.

Never mind, I've got it why.

@nantonel
Copy link
Member

nantonel commented Jul 9, 2020

Ok, so I think we agree in the fact that all buffers in Compose should be either only in CPU or GPU.

Then I'm sorry really sorry but I don't see what are the advantages of the delayed buffer allocations you proposed.

If it's about sharing buffers it would be preferable to do this during construction. Something like:

function Compose(L::Array{AbstractOperators}; T=Array)
  # check properties of `L` and see if any buffer can be shared
  # allocate only the buffers that are needed (either in GPU or CPU using `T`)
  # Construct `Compose` using `Compose((L[1],L[2],...),(buf1,buf2,buf1,...))`
end

@nantonel
Copy link
Member

nantonel commented Jul 9, 2020

I mean, in my opinion, I think the minimal amount of code will be produced by having all calculus rules array agnostic. This is definitively not the case, but it shouldn't be too hard to achieve, for example by adding that keyword argument (and perhaps by giving an extra field e.g. Compose{T,...} where {T<:AbstractArray,...}).

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 9, 2020

Well, neither of us seems to convince the other. 😄 Unfortunately I don't really have more arguments, and I don't want to waste your time (I really appreciate that you share your concerns with my proposal), so all I can do is that I can try to summarize my pros and cons.

Recommended workflow of GPU computing

  1. Develop an application using generic array functionality, and test it on the CPU with the Array type
  2. Port your application to the GPU by switching to the CuArray type
  3. disallow the CPU fallback to find operations that are not implemented for or incompatible with GPU execution (usually it means small loops with "scalar indexing")

What important here is that if you want to make it easy to use this package in GPU codes, then the best way to help the programmer is to make the transition from 1. to 2. as easy as possible. My proposal makes the transition easier because it doesn't require any explicit declaration as it can figure out what type of buffer needs to be allocated on its own. In the case of simple programs, it is not a big deal, but for larger programs with many operators, it can be quite cumbersome looking up all constructors.

One code for CPU and GPU

Even if I'm ready with the GPU code, it is quite frequent that I need to return to the CPU code. For example, I would like to extend or improve the algorithm because debugging a CPU code is much easier than a GPU code. Or I would like to work on my laptop (it doesn't have any GPU). Or someone else just wants to check out my code, but he or she doesn't have a GPU or doesn't want the hassle with installing the proper CUDA driver.

If I can use the same code (or almost the same code) for both CPU and GPU, then I can save a tremendous amount of energy synchronizing the two implementations after each modification.

Shorthand constructors

For me, a very important advantage of using AbstractOperators.jl/FunctionOperators.jl/LinearMaps.jl/LinearOperators.jl is that I can write more elegant code that resembles the mathematical notations used in literature. And shorthand constructors are one of the most important tools to reach that. So my problem with the approach you proposed is that it cannot be adapted to shorthand constructors (or at least I cannot see an elegant way to do so).

Main point

From the previous arguments, I hope it is clear why I dislike your proposal with the extra argument in the constructor. And I was thinking a bit, and finally I came to the conclusion that there might be quite a few other solutions besides mine which would satisfy me, so I'm totally open to some other alternatives. Just to mention a very simple: we can have a global boolean flag that decides whether Arrays or CuArrays are used as buffers (although I find this solution particularly ugly)

However, I found one argument why my approach would be better than others, and it is thread-safeness. Let's assume, for example, that we want to use the same operator in multiple threads. I can solve it easily by extending the previously proposed structs:

struct Buffer
    b::Dict{Int, Union{Nothing,Any}}
end

struct Compose{N, M, L<:NTuple{N,Any}, T<:NTuple{M,Buffer}} <: AbstractOperator
    A::L
    buf::T
end

@generated function mul!(y::C, L::Compose{N,M,T1,T2},b::D) where {N,M,T1,T2,C,D}
    ex = quote
        L.buf[1].b[Threads.threadid()] isa Nothing && ( L.buf[1].b[Threads.threadid()] = Array{codomainType(L.A[1])}(undef, size(L.A[1],1)) )
        mul!(L.buf[1].b[Threads.threadid()],L.A[1],b))
    end
    for i = 2:M
    	ex = quote
    	    $ex
            if L.buf[$i].b[Threads.threadid()] isa Nothing
                L.buf[$i].b[Threads.threadid()] = Array{codomainType(L.A[$i])}(undef, size(L.A[$i],1))
            end
            mul!(L.buf[$i].b[Threads.threadid()],L.A[$i], L.buf[$i-1].b[Threads.threadid()])
        end
    end
    ex = quote
    	$ex
        mul!(y,L.A[N], L.buf[M].b[Threads.threadid()])
        return y
    end
end

Thread-safeness might be achieved in other ways as well, but even besides that, I have the intuition that deferring the allocation gives much more flexibility due to the increased amount of information we can use in mul! (e.g. what is the type of the input or which thread we are in). And btw, the example above even eliminated the necessity of a mutable struct! 😄

Reactions

Ok, so I think we agree in the fact that all buffers in Compose should be either only in CPU or GPU.

Yes, definitely! Although previously I have written the following:

Btw it would mean that once an operator is applied on GPU arrays, it cannot be used on CPU arrays any more and vice versa. But it is possible to tackle this problem by modifying the definition of the struct to have two separate list of buffers: one for CPU buffers and the other for GPU buffers.

What I meant here is that sometimes such a thing can be handy, for instance, when I do the transition from the CPU code to GPU in an interactive session, but actually it would just complicate things, so let's just forget about it! 😄

If it's about sharing buffers it would be preferable to do this during construction. Something like: ...

You are right, sharing buffers can also be achieved with your method. So I tried to explain above why my proposal is absolutely not only about sharing buffers.

not sure, but mutable objects are more "dangerous"...

I disagree. I don't see any problems with mutable objects.

for example by adding that keyword argument (and perhaps by giving an extra field e.g. Compose{T,...} where {T<:AbstractArray,...}).

Can you elaborate this? Can this approach avoid having different constructors in CPU and GPU code?

@nantonel
Copy link
Member

nantonel commented Jul 10, 2020

Don't you worry it's good that we discuss, it's not a waste of time at all, this is just a brainstorming and quite an important one 😉

Perhaps an alternative would be to change the abstract type AbstractOperator{T} where {T<:AbstractArray}.

This would allow to know where each operator lives, either in GPU or CPU. You will be then allowed to combine only operators that live in the same architecture. If you want to switch from one operator that lives in CPU to GPU all you have to do is to convert that object. A function togpu (and tocpu) could be made for this purpose.

In the case of composition:

togpu(C::Compose{Array,...}) = Compose{CuArray,...}(togpu.(C.A),togpu.(C.buf) )
togpu(C::Compose{CuArray,...}) = C

From the user prospective this would involve writing code like this:

T = Array # here you can set to CuArray 
A = AnOperator(T,...)
B = AnotherOperator(T,...)
C = SomeCalc(A,B)

For operators that can have the same code for GPU and CPU I believe this thing should work, you just write:

mul!(y::AbstractArray,L::Operator,::AbstractArray)

and for those that need different implementations:

mul!(y::T,A::Operator{T,...},::T) where {T <: CuArray} = ...
mul!(y::T,A::Operator{T,...},::T) where {T <: Array} = ...

Compose would become:

function Compose(L1::AbstractOperator{T}, L2::AbstractOperator{T}) where {T<:AbstractArray}
    if size(L1,2) != size(L2,1)
    	throw(DimensionMismatch("cannot compose operators"))
    end
    if domainType(L1) != codomainType(L2)
    	throw(DomainError())
    end
    Compose( L1, L2, T{domainType(L1)}(undef,size(L2,1)) )
end

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 11, 2020

Well, I still don't see what's wrong with the deferred allocation, and why we need to complicate our lives and increase unnecessarily the verbosity in function signatures 😄 (generally, I think more about the inconvenience of using later the package than the current implementational difficulties). In my opinion, the computing device is just a technical detail (both CPU and GPU give exactly the same result), so the usage of AbstractOperators should not necessarily reflect the computing device. I know, it is probably the most common way (most notably in Python packages), but still, I don't like it, especially because in Julia we could avoid it...

Anyway, if I put aside my dislike towards the explicitness and the extra parameter, your last proposal looks quite ok for me. 🙂

@lostella
Copy link
Member

I may have missed some details from the discussion, but here’s my two cents:

The issue is whether input/intermediate/output array types T should be

  1. fixed at construction time, in which case T could be part of the operator type as @nantonel proposes;
  2. decided with the first operator call, in which case T cannot be part of the operator type.

Option 1 is more static, but closer to the current package structure and therefore probably easier to implement: for example, the input/output dimensions are to be specified somehow at construction time (see eg the constructors here https://github.com/kul-forbes/AbstractOperators.jl/blob/master/src/linearoperators/DFT.jl).

Option 2 is more dynamic, and I can see how it would be easier to use, but would probably require more work.

It seems to me like in the end, computationally speaking, both options should be equivalent. Even if you have a “slicker” interface as in option 2, computation will require buffer types and sizes to be known to carry out.

So one question to ask is whether both options can be obtained. In particular, going for option 1 first would be easier as I understand, but one could then allow for “lazier” construction syntax as in option 2: this would just store the operator parameters (not array sizes, not array types), and then “concretize” the operator once called. I’m sorry I don’t have code snippets to showcase this, but in my mind this can be implemented in a fairly modular way, so that one can use both syntaxes as preferred and the code overhead is minimal: the problem I see here is with combination rules, which would require to be implemented for this “lazy layer” too.

The only thing I’m not a big fan of here is tocpu and togpu, since such a mechanism would need to hard-code the array types to be used on the two devices.

@nantonel
Copy link
Member

I agree with @lostella, option 1 will avoid changing calculus rules at the expense of changing the AbstractOperator type,
option 2 will require modification of all calculus rules.

To me option 1 seems to me to be more consistent with the current philosophy of this package: static automatic differentiation (AD).

Fully dynamic AD softwares are already available in Julia, for example Zygote which takes this dynamism at the compiler level. These definitively fully exploit what you can do with Julia. In my opinion, going towards that direction with AbstractOperators makes little sense (unfortunately)...

So one question to ask is whether both options can be obtained. In particular, going for option 1 first would be easier as I understand, but one could then allow for “lazier” construction syntax as in option 2: this would just store the operator parameters (not array sizes, not array types), and then “concretize” the operator once called. I’m sorry I don’t have code snippets to showcase this, but in my mind this can be implemented in a fairly modular way, so that one can use both syntaxes as preferred and the code overhead is minimal: the problem I see here is with combination rules, which would require to be implemented for this “lazy layer” too.

I see this possibility as well, but perhaps it would be tidier to have a package on top of AbstractOperator to achieve this: LazyAbstractOperators? 😆 Probably StructuredOptimization could be a suited candidate as well.

The only thing I’m not a big fan of here is tocpu and togpu, since such a mechanism would need to hard-code the array types to be used on the two devices.

That's quite important point! But I wonder if you ever want want to deal with combinations of operators that have different array types... for example if you use static arrays I'd imagine you want to stick to those as well?

Regarding verbosity of function signatures, well you can still keep it clean, for example you can build already DCT using the following:

julia> x = randn(10); DCT(x)
ℱc  ℝ^10 ->^10 

julia> x = randn(Complex{Float64},10);DCT(x)
ℱc  ℂ^10 ->^10

Here, dimension and domain type are deuced from x. Same could be done for the type of array.

@nantonel
Copy link
Member

nantonel commented Jul 12, 2020

By the way, I'm not against this deferred allocation, my only argument was avoiding mutable struct, and I can see it is not such a strong one! 😅

This is probably fine:

@generated function mul!(y::CuArray, L::Compose{N,M,T1,T2},b::CuArray) where {N,M,T1,T2}
    ex = quote
        L.buf[1] isa Nothing && ( L.buf[1] = CuArray{codomainType(L.A[1])}(undef, size(L.A[1],1)) )
        mul!(L.buf[1],L.A[1],b))
    end
    ...
end

The problem with this is that some operators inside Compose could have buffers on their own. Using default constructors these will be Array, living in CPU (an example).

In my opinion there must a strategy to avoid this problem and ensure everything lives in the right architecture. Option 2 fixes these problems. With Option 1 I don't see how to avoid this problem without modifying the constructors anyway. Plus it would be up to the user to carefully build its operators in the right architecture.

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 12, 2020

Hi @lostella,
yes, I think you summarized the question quite well: the two options are computationally equivalent, and it is just a matter of interface and implementation.

To me option 1 seems to me to be more consistent with the current philosophy of this package: static automatic differentiation (AD).

Hmm... ok, that's a good point. You guys might see better the "bigger picture" and you know better what was the main purpose of creating this package. 🙂

The only thing I’m not a big fan of here is tocpu and togpu, since such a mechanism would need to hard-code the array types to be used on the two devices.

That's quite important point! But I wonder if you ever want want to deal with combinations of operators that have different array types... for example if you use static arrays I'd imagine you want to stick to those as well?

I agree that tocpu and togpu shouldn't be encouraged to be used. Sometimes the functionality they provide can be useful, so it worth considering implementing them as auxiliary functions (especially because these functions are easy to implement), but both Option 1 and 2 give a better and more intuitive way to define CPU/GPU operators.

The problem with this is that some operators could have buffers on their own. Using default constructors these will be Array, living in CPU (an example).

Yes, all operators (both "atomic" like DFT or Filt and combination rules) need to be modified which allocates a buffer. For example, the same trick can be used in the case of Filt as the previous examples for Compose.

So one question to ask is whether both options can be obtained. In particular, going for option 1 first would be easier as I understand, but one could then allow for “lazier” construction syntax as in option 2: this would just store the operator parameters (not array sizes, not array types), and then “concretize” the operator once called. I’m sorry I don’t have code snippets to showcase this, but in my mind this can be implemented in a fairly modular way, so that one can use both syntaxes as preferred and the code overhead is minimal: the problem I see here is with combination rules, which would require to be implemented for this “lazy layer” too.

I see this possibility as well, but perhaps it would be tidier to have a package on top of AbstractOperator to achieve this: LazyAbstractOperators?

I don't think that a dynamic/lazy implementation can be implemented on the top of AbstractOperator because if we allocate "statically" in the constructor of the base package, then it makes no sense anymore to have some dynamic magic on top of it. What I can imagine is a fork of AbstractOperator, but it would be a nightmare to maintain. 😄

Regarding verbosity of function signatures, well you can still keep it clean, for example you can build already DCT using the following:

Yes, this option would be a comfortable alternative for lazy programmers like me, if we opt to option 1. 😄

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 12, 2020

What I see now that both sides have quite strong arguments, which can be summarized in a following:

  • Option 1 / static approach / allocation in constructor (extra parameter in constructor)
    • Closer to the philosophy of the package
    • Leaves larger part of the code intact: extension to GPU is achieved by overloading rather than rewriting functions (if it ain't broken, don't fix it)
    • Advantage of explicitness: Users are more aware of the computing device, and it is slightly easier to provide meaningful error messages (easier debugging).
    • No need for mutable structs
  • Option 2 / dynamic approach / deferred allocation
    • Advantage of implicitness: More comfortable to use, turning CPU code to GPU code is a smoother process and it is also easier running GPU code on CPU, if no GPU is available (no need to pass around a variable that defines the computing device)
    • Easier thread-safeness (?)
  • Hybrid solutions
    • a package on top of AbstractOperator (?)
    • general interface that implements both option 1 and 2 (?)

But now I'm more and more uncertain which one is better...

@nantonel
Copy link
Member

For me it's clear that option 1 is more suited. But certainly coding will clear things up even more (other problems may arise). Let me know if you're planning a PR, otherwise I'll take care of it as soon as I have some time.

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 13, 2020

Ok, then let's stick with option 1. I plan to make a PR, and I will start working on it soon.

@nantonel
Copy link
Member

nantonel commented Jul 15, 2020

I've realised that we should avoid modifying the syb-typing of AbstractOperator.

Instead we can introduce a predicate isgpu.

For example:

A = AnOperator(CuArray....)
isgpu(A::AnOperator) = typeof(A.buffer) <: CuArray # for operators with buffers
isgpu(A::AnOperator{T}) = T <: CuArray # for other operators with no buffers 
isgpu(A::AnOperator) = ... # for complete freedom of defining this 

This solves the problem that @lostella pointed out, that is we will have more freedom on the type of arrays to use.

An example of composition is:

function Compose(L1::AbstractOperator, L2::AbstractOperator)
    if size(L1,2) != size(L2,1)
    	throw(DimensionMismatch("cannot compose operators"))
    end
    if domainType(L1) != codomainType(L2)
    	throw(DomainError())
    end
    if isgpu(L1) != isgpu(L2)
    	throw(ErrorException("Operators are not in the same architecture, use `togpu` or `tocpu` to convert them to the proper one."))
    end
   T = isgpu(L1) ? CuArray : Array
    Compose( L1, L2, T{domainType(L1)}(undef,size(L2,1)) )
end

This is also more consistent to what we've done in the package for example in the functions domainType, codomainType: initially we had those as subtypes of AbstractOperator but it turned out to be a bad choice.

(It should also involve less coding)

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 18, 2020

Well, I'm not really a fan of this solution because it looks like a patchwork: sometimes the type of buffer carries the information, in other cases, the computing device is inferred from the type of the operator. I would rather prefer a more consistent solution.

The other issue I see is that CUDA.jl isn't the only option to use GPU. OpenCL and AMDGPUnative.jl might also be object of a future extension, and these packages are incompatible with each other and CUDA.jl. Therefore, I find it a better solution to implement deviceType (I'm still thinking of a better name) that returns a type instead of a boolean.


For the second problem, there is an easy solution:

A = AnOperator(CuArray....)
deviceType(A::AnOperator) = typeof(A.buffer) # for operators with buffers
deviceType(A::AnOperator{T}) = T # for other operators with no buffers 
deviceType(A::AnOperator) = ... # for complete freedom of defining this 

But I don't know, what to do with the first.

@nantonel
Copy link
Member

nantonel commented Jul 19, 2020

It's not really a patch. As I told you we do something really similar with DomainType.

It would be still possible to use other architectures: you just create different packages (which makes sense given those packages are incompatible with each other).
For CUDA.jl you have AbstractOperatorsCUDA.jl, for OpenCL.jl you have AbstractOperatorOpenCL.jl, etc... This also makes sense because AbstractOperators dependencies are kept low, so it is more of a "core" library.

What this packages will contain are simply extensions the methods of mul!, for example AbstractOperatorsCUDA.jl exports mul!(y::CuArray,A::Operator,x::CuArray). You can eventually redefine some calculus rules there, as you like, for example to ensure thread safety etc.

AbstractOperatorsCUDA.jl will also export something like GPUArrayType() = CuArray , which will be used for example in Compose:

T = iscpu(L1) ? Array : GPUArrayType()
Compose( L1, L2, T{domainType(L1)}(undef,size(L2,1)) )

In AbstractOperator we could simply have iscpu, when iscpu returns false it means we are using an array in a different architecture depending on which AbstractOperator$(Extension).jl we imported.

@hakkelt
Copy link
Contributor Author

hakkelt commented Jul 20, 2020

I must admit I also found the solution of domainType not quite elegant, but it might reflect only my personal taste, so I can easily let it go.

On the other hand, I think you cannot avoid adding CUDA.jl as a dependency of the base AbstractOperators.jl package because the constructors might also have CuArray in their signature. Should AbstractOperatorsCUDA.jl also contain the overloaded constructors? But it would mean that an even large portion of the codebase would be shared which might lead in turn to maintainability issues, especially because constructors are prone to later modifications. If we really want to separate CPU and GPU code (including the case when CPU and GPU code are separated into two packages), then the only option I see is to opt for the earlier discussed Option 2 / dynamic / "deferred allocation" approach where constructors are fully shared between all CPU and GPU solutions.

Also, I don't see why is it better to return a boolean instead of a type. Returning a type would make the code shorter and more readable (we can avoid the conditional operator in your example code, for instance), and it is still easy to check the returned type against the one of interest (e.g. Array) to get a boolean value.

@nantonel
Copy link
Member

nantonel commented Jul 22, 2020

I must admit I also found the solution of domainType not quite elegant, but it might reflect only my personal taste, so I can easily let it go.

Perhaps it's not elegant, but it simplifies a lot of things. In my experience in Julia avoiding sub-typing when you can gets more freedom.

On the other hand, I think you cannot avoid adding CUDA.jl as a dependency of the base AbstractOperators.jl package because the constructors might also have CuArray in their signature. Should AbstractOperatorsCUDA.jl also contain the overloaded constructors?

Yes that was my idea. Only the ones that are necessary, i.e. those that don't work with AbstractArray.

But it would mean that an even large portion of the codebase would be shared which might lead in turn to maintainability issues, especially because constructors are prone to later modifications.

I don't see the difference of having that on a single packages, only advantages (version tagging can be used to make sure things work when a package advances with respect to another). Anyway that was just to reply to your problem of being able to use different GPUs architectures/packages.

If we really want to separate CPU and GPU code (including the case when CPU and GPU code are separated into two packages), then the only option I see is to opt for the earlier discussed Option 2 / dynamic / "deferred allocation" approach where constructors are fully shared between all CPU and GPU solutions.

If you think that's the best thing go for it. 😄

Also, I don't see why is it better to return a boolean instead of a type. Returning a type would make the code shorter and more readable (we can avoid the conditional operator in your example code, for instance), and it is still easy to check the returned type against the one of interest (e.g. Array) to get a boolean value.

This was to address what @lostella said:

The only thing I’m not a big fan of here is tocpu and togpu, since such a mechanism would need to hard-code the array types to be used on the two devices.

With the boolean this hard-coding is avoided.

@hakkelt hakkelt mentioned this issue Apr 10, 2024
@nantonel
Copy link
Member

@hakkelt Thanks for that CR! Do you think we can close this?

@hakkelt
Copy link
Contributor Author

hakkelt commented Apr 14, 2024

Yes, I think we are done with that issue.

@hakkelt hakkelt closed this as completed Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants