Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opencl training fail #30

Open
SolarPeng opened this issue Jun 1, 2016 · 10 comments · Fixed by #49
Open

opencl training fail #30

SolarPeng opened this issue Jun 1, 2016 · 10 comments · Fixed by #49

Comments

@SolarPeng
Copy link

I have never be successful on training.

th train.lua --opencl --dataset 50000 --hiddenSize 1000

-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
Vocabulary size: 25931
Examples: 83632
libthclnn_searchpath /Users/SolarKing/Dev/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: GeForce 9400M

-- Epoch 1 / 50

/Users/SolarKing/Dev/torch/install/bin/luajit: ...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
bad argument #3 to '?' (number expected, got nil)
stack traceback:
[C]: at 0x0ebe4500
[C]: in function '__newindex'
.../Dev/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function <.../Dev/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:99>
[C]: in function 'xpcall'
...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...arKing/Dev/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./seq2seq.lua:71: in function 'train'
train.lua:85: in main chunk
[C]: in function 'dofile'
.../Dev/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x010e8bbbb0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above.
stack traceback:
[C]: in function 'error'
...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
...arKing/Dev/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./seq2seq.lua:71: in function 'train'
train.lua:85: in main chunk
[C]: in function 'dofile'
.../Dev/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x010e8bbbb0

@lfuelling
Copy link
Contributor

lfuelling commented Jun 20, 2016

I'm also using torch-cl, following the tutorial there, you shouldn't install nn, cudnn, cldnn, etc. because it break the installation. The only things I installed were rnn and penlight.

Got something similar:

lerk@blrg:~/workspace/neuralconvo$ th train.lua --opencl
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /home/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce GTX 660

-- Epoch 1 / 50

/home/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x7f142d00baa0
    [C]: in function '__newindex'
    ...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    /home/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./seq2seq.lua:66: in function 'train'
    train.lua:88: in main chunk
    [C]: in function 'dofile'
    ...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00405e90

UPDATE: I also tried this on my MacBook Pro, same error:

lerk@blackreach ~/workspace/neuralconvo                                                                                                         [14:33:45]
> $ th train.lua --opencl                                                                                                                      [±master ✓]
-- Loading dataset
data/vocab.t7 not found
-- Parsing Cornell movie dialogs data set ...
 [=============================================================== 387810/387810 =======>] Tot: 1s615ms | Step: 0ms
-- Pre-processing data
 [================================================================ 166194/166194 ======>] Tot: 31s885ms | Step: 0ms
-- Removing low frequency words
 [================================================================ 221282/221282 ======>] Tot: 14s809ms | Step: 0ms
Writing data/examples.t7 ...
 [=============================================================== 221282/221282 =======>] Tot: 33s43ms | Step: 0ms
Writing data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /Users/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: ATI Radeon HD 6770M

-- Epoch 1 / 50

/Users/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x05350f40
    [C]: in function '__newindex'
    ...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    ...rs/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./seq2seq.lua:66: in function 'train'
    train.lua:88: in main chunk
    [C]: in function 'dofile'
    ...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x0105008d00

@lfuelling
Copy link
Contributor

lfuelling commented Jun 20, 2016

The stacktrace suggests that the Error is on this line:

local encoderOutput = self.encoder:forward(encoderInputs)

I tried to locate the error and I think it's on line 88 in train.lua:

model:train(encInputs, decInputs, decTargets)

Could it be that #29 introduced this? It's the latest change on this line. previously it was:

local err = model:train(input, target)

I'll try to fix this somehow (I don't even know lua) and get back here then.

UPDATE: I checked out the last commit before the merge and I got the same error again. Only the hex numbers differ:

lerk@blrg:~/workspace/neuralconvo$ th train.lua --opencl
-- Loading dataset
data/vocab.t7 not found
-- Parsing Cornell movie dialogs data set ...
 [=============================================================== 387810/387810 =======>] Tot: 1s391ms | Step: 0ms
-- Pre-processing data
 [============================================================= 166194/166194 =========>] Tot: 5m14s | Step: 0ms
-- Removing low frequency words
 [============================================================ 221282/221282 ==========>] Tot: 7m6s | Step: 1ms
Writing data/examples.t7 ...
 [============================================================ 221282/221282 ==========>] Tot: 7m4s | Step: 5ms
Writing data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /home/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce GTX 660

-- Epoch 1 / 50

/home/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x7fcd865b3aa0
    [C]: in function '__newindex'
    ...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    /home/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./seq2seq.lua:66: in function 'train'
    train.lua:88: in main chunk
    [C]: in function 'dofile'
    ...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00405e90

@mgomes
Copy link

mgomes commented Jul 7, 2016

I am hitting this as well. I think something changed in the last month or so where the cltorch and clnn modules are no longer supported via luarocks. Instead you have to use the torch-cl distro.

The problem is coming from train.lua:70 in model:getParameters(). That no longer returns the parameters. I'm still looking into it.

@lfuelling
Copy link
Contributor

lfuelling commented Aug 18, 2016

@mgomes you got anything so far?

I tried this again and stumbled upon the following part in the official torch source:

function optim.adam(opfunc, x, config, state)
   -- (0) get/update state
   local config = config or {}
   local state = state or config
   local lr = config.learningRate or 0.001
   local lrd = config.learningRateDecay or 0

   local beta1 = config.beta1 or 0.9
   local beta2 = config.beta2 or 0.999
   local epsilon = config.epsilon or 1e-8

In the stacktrace, the epsilon allocation is mentioned to being a nil value while expecting a number. I assume that in cltoroch (distro-cl) there is no default value for this but I am unable to find the file in cltorch.

The config object that gets passed to the function above is the following:

{
  momentum : 0.9
  learningRate : 0.001
}

Here's another stacktrace:

> $ th train.lua --opencl                                                                                           [±master ●]
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /Users/lfuelling/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: Iris Pro

-- Epoch 1 / 50  (LR= 0.001)

{
  momentum : 0.9
  learningRate : 0.001
}
/Users/lfuelling/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x01deef20
    [C]: in function '__newindex'
    ...ling/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    ...uelling/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    train.lua:93: in function 'opfunc'
    .../lfuelling/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
    train.lua:129: in main chunk
    [C]: in function 'dofile'
    ...g/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x0101aa6cf0

UPDATE: I'm stupid. If you read the stacktrace, you'll notice Using OpenCL device: Iris Pro. I bet it works when I use the external GPU. neural-style has an option to set the GPU you want. I'll try to implement this.

@hughperkins
Copy link

Ok, I fixed a bunch of bugs yesterday. I think the easiest thing to do will be to simply reinstall distro-cl, since there were a bunch of fixes, and specifically, rnn is pinned now, via rocks-cl, which implies a change to your torch-cl/install/etc/luarocks/config.lua file, to have one adiditonal rocks_server, as follows:

rocks_servers = {
   [[https://raw.githubusercontent.com/hughperkins/rocks-cl/master]],
   [[https://raw.githubusercontent.com/torch/rocks/master]],
   [[https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master]]
}

There was also a change to the exe/luajit-rocks submodule, to point to https://github.com/hughperkins/luajit-rocks , to hold this configuration.

I just now tested a full fresh reinstallation, using hte following commands:

git clone --recursive https://github.com/hughperkins/distro-cl torch-cl
cd torch-cl
bash install.sh -b
source /data/torch-cl/install/bin/torch-activate  # normally this would be ~/torch-cl/... for you
luarocks install rnn
luarocks install torchx
cd /data/git/neuralconvo
bfboost client -r th train.lua --opencl   # you wont need/want the `bfboost client -r` bit, this is just because I'm running on bfboost
# et voila, running, see screenshot

Screenshot:
neuralconvo3

@lfuelling
Copy link
Contributor

For those too lazy to read the file: -b doesn't prompt for anything. Watch your .whateverrc after the install to remove duplicate entries of torch-activate.

@hughperkins
Copy link

its not working yet .... I'm still trying to fix it. I got as far as maskedSelect being implemented, but it currently causes a segfault under the present scenario, which I need to look into. I think you might as well leave this open for now really?

@lfuelling
Copy link
Contributor

I think it was automatically closed. Ping @macournoyer

@macournoyer macournoyer reopened this Aug 23, 2016
@macournoyer
Copy link
Owner

Ooops! Autoclosed indeed.

@hughperkins
Copy link

Might be working now. Can you pull down latest updates to distro-cl, and retry?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants