Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EvoTrees consumes too much memory and crashes for sparse matrices #87

Closed
pgagarinov opened this issue Mar 31, 2021 · 7 comments
Closed

Comments

@pgagarinov
Copy link

pgagarinov commented Mar 31, 2021

Let us consider the real ML problem (I cannot show the feature names but each row is a separate feature) with 600k observations:

image

As we can see we have many categorical features with approximately 1500 categories for each feature on average.

Let us approximate real features with random one-hot encoded features represented by sparse matrices and try to apply both XGBoost and EvoTrees:

image

As we can see XGBoost handles very sparse data very well while EvoTrees allocates too much memory and crashes with OOM error.

The notebook is attached
julia_evotrees.zip

The code to reproduce as the plain script:

# -*- coding: utf-8 -*-
# ---
# jupyter:
#   jupytext:
#     formats: ipynb,jl:light
#     text_representation:
#       extension: .jl
#       format_name: light
#       format_version: '1.5'
#       jupytext_version: 1.11.1
#   kernelspec:
#     display_name: Julia 1.6.0-rc3
#     language: julia
#     name: julia-1.6
# ---

# + tags=[]
using Pkg; Pkg.activate(".");

# + tags=[]
Pkg.add(["Statistics", "StatsBase", "Revise", "EvoTrees", "BenchmarkTools", "CUDA", "SparseArrays"]);

# + tags=[]
using Statistics
using StatsBase
using XGBoost
using Revise
using EvoTrees
using BenchmarkTools
using CUDA
using SparseArrays

# + tags=[]
nrounds = 200;
nthread = Threads.nthreads();

# + tags=[]
# xgboost aprams
params_xgb = ["max_depth" => 5,
         "eta" => 0.05,
         "objective" => "reg:squarederror",
         "print_every_n" => 5,
         "subsample" => 0.5,
         "colsample_bytree" => 0.5,
         "tree_method" => "hist",
         "max_bin" => 64]
metrics = ["rmse"]

# + tags=[]
# EvoTrees params
params_evo = EvoTreeRegressor(T=Float32,
        loss=:linear, metric=:mse,
        nrounds=nrounds, α=0.5,
        λ=0.0, γ=0.0, η=0.05,
        max_depth=6, min_weight=1.0,
        rowsample=0.5, colsample=0.5, nbins=64)

# + tags=[]
function random_select(n::Int64,K::Int64) 
    @assert 0<=n<=K

    sample=Vector{Int64}(undef, n)
    t=Int64(0)
    m=Int64(0)

    while m<n
        if (K-t)*rand()>=n-m
            t+=1
        else
            m+=1
            sample[m]=t
            t+=1
        end
    end
    sample
end


# + tags=[]
function create_sparseMatrix(n::Int64,N::Int64,M::Int64)
    @assert (0<=N)&&(0<=M)
    @assert 0<=n<=N*M

    nonZero = random_select(n,N*M)

    # column major: k=i+j*N
    I = map(k->mod(k,N),nonZero)
    J = map(k->div(k,N),nonZero)

    sparse(I.+1,J.+1,ones(n),N,M)
end
# -

sparseones(N,M,K) = sparse(
  (x->(first.(x).+1,last.(x).+1))(divrem.(sample(0:N*M-1,K,replace=false),M))...,
  ones(K),N,M
)

# + tags=[]
nobs = Int(600000)
num_feat = Int(50);
n_cats_per_feature = Int(1500);
@info "testing with: $nobs observations | $num_feat features."

# + tags=[]
X = sparseones(nobs, num_feat*n_cats_per_feature, Int64(nobs*num_feat));

# + tags=[]
Y = rand(size(X, 1));

# + tags=[]
@info "xgboost train:"
@time m_xgb = xgboost(X, nrounds, label=Y, param=params_xgb, metrics=metrics, nthread=nthread, silent=1);
# @btime xgboost($X, $nrounds, label=$Y, param=$params_xgb, metrics=$metrics, silent=1);

# + tags=[]
@info "xgboost predict:"
@time pred_xgb = XGBoost.predict(m_xgb, X);
# @btime XGBoost.predict($m_xgb, $X);

# + tags=[]
@info "evotrees train CPU:"
params_evo.device = "cpu"
@time m_evo = fit_evotree(params_evo, X, Y);
# @btime fit_evotree($params_evo, $X, $Y);
# -

@info "evotrees predict CPU:"
@time pred_evo = EvoTrees.predict(m_evo, X);
#@btime EvoTrees.predict($m_evo, $X);

# + tags=[]
CUDA.allowscalar(false)
@info "evotrees train GPU:"
params_evo.device = "gpu"
@time m_evo_gpu = fit_evotree(params_evo, X, Y);
#@btime fit_evotree($params_evo, $X, $Y);

# + tags=[]
@info "evotrees predict GPU:"
@time pred_evo = EvoTrees.predict(m_evo_gpu, X);
#@btime EvoTrees.predict($m_evo_gpu, $X);
# -


@jeremiedb
Copy link
Member

jeremiedb commented Mar 31, 2021

EvoTrees is effectively designed around dense matrix, so the poor performance on sparse matrix is unfortunately expected.
The library was initially built around my uses cases which didn't involve sparse features.

For clarification, my understanding is that for your use case, the data is sparse as a result of one-hot encoding categorical features with large number of modalities as opposed to being intrinsically sparse from a large number of different features that are sparsely populated, is that correct?

Although later case appears trickier to adapt to, I think support for categorical feature would be a reasonable feature to add. It would involve passing such features as Integers (so a compact "one-cold" representation as opposed to "one-hot"). A treatment similar to what catboost/lightgbm does would then be applied.

This won't be solved by today though :) Suggestions/comments on the proper approach are welcomed!

@ablaom
Copy link
Contributor

ablaom commented Mar 31, 2021

A treatment similar to what catboost/lightgbm does would then be applied.

Can you point out a document explaining how these algorithms handle "categorical" features?

@jeremiedb
Copy link
Member

jeremiedb commented Apr 1, 2021

Sure, here are 2 docs about it with CatBoost:
https://arxiv.org/pdf/1706.09516.pdf
http://learningsys.org/nips17/assets/papers/paper_11.pdf

For lightgbm: https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features, though I haven't access to the referenced 1958 paper :)

There are a couple of ways around it. One can be to simply consider the category index to be treated as a numeric feature. This obviously result in arbitrary grouping of categories, but can nonetheless be valuable. Otherwise, main idea is to have groupings based on target related metrics. For example, a mapping of index into numeric features based on credibility adjusted observed target (so as to limit overfit otherwise associated with straight average of the target).
CatBoost has a nice explainer on their methodology here: https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html

So the lazy turnaround to use the category encoding as a numeric input can already be used right now. As for a fancier handling of the groupings, the credibility / regularized target values appears to me as the lowest hanging fruit (similar to the CatBoost explainer), though I'm unclear whether some alternative approach been proven worth the implementation effort.

@pgagarinov
Copy link
Author

pgagarinov commented Apr 1, 2021

For clarification, my understanding is that for your use case, the data is sparse as a result of one-hot encoding categorical features >with large number of modalities as opposed to being intrinsically sparse from a large number of different features that are sparsely >populated, is that correct?

Yes, that is correct.

This won't be solved by today though :) Suggestions/comments on the proper approach are welcomed!

I bet!
As for the comments on the proper approach - I would implement CatBoost encoder from https://contrib.scikit-learn.org/category_encoders/catboost.html in Julia.
https://catboost.ai/docs/concepts/educational-materials-papers.html

It would involve passing such features as Integers (so a compact "one-cold" representation as opposed to "one-hot")

I totally agree with this approach - either integers or Categorical Arrays, the rest of the transformations should happen inside the GBM library itself, not like in XGBoost.

@pgagarinov
Copy link
Author

More articles on the subject:

https://catboost.ai/docs/concepts/educational-materials-papers.html
https://towardsdatascience.com/categorical-features-parameters-in-catboost-4ebd1326bee5

We could use https://contrib.scikit-learn.org/category_encoders/_modules/category_encoders/cat_boost.html#CatBoostEncoder as a reference implementation.

This package
https://contrib.scikit-learn.org/category_encoders/index.html
contains other useful categorical features encoders that are currently missing in MLJ. For now I ended up calling those encoders from Julia via PyCall.

@ablaom
Copy link
Contributor

ablaom commented Apr 14, 2021

Thanks @pgagarinov. Cross referenced here: JuliaAI/MLJModels.jl#375

@jeremiedb
Copy link
Member

Given the improvements to memory footprint throughout various releases since the issue was open and the added support for categorical data through the new Tables API since v0.16, I'd close the PR. Don't hesitate to reopen if the mentioned concerned are still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants