Different error when running test: Undefined symbol: _ZN10tensorflow3PadERKN5Eigen9GpuDeviceEPKfiiiiiiPf #93

fperezgamonal · 2019-04-22T11:47:10Z

Hello all,

After successfully compiling the code by addressing some problems through issues #76 , #65 , #28 , when I run
python -m src.flownet2.test --input_a data/samples/0img0.ppm --input_b data/samples/0img1.ppm --out ./, the following error is reported:

  File "/soft/easybuild/debian/8.8/Broadwell/software/Tensorflow-gpu/1.10.0-foss-2017a-Python-3.6.4/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /homedtic/fperez/Documents/Papers_code/TFM/state_of_the_art/DL/FlowNet2/flownet2-tf/src/./ops/build/correlation.so: undefined symbol: _ZN10tensorflow3PadERKN5Eigen9GpuDeviceEPKfiiiiiiPf

I have tried the proposed solutions for undefined symbol-related errors (issues #8 , #41 and #87 ) without success. I have noticed that the undefined symbol is different from any other post on this repository so I have checked on the tensorflow repository for similar errors and only found this issue which suggests to recompile without GPU support and adding "-c" flag but I do not know how to it applies to my case (and compiling only under CPU will yield very slow training and inference...).

Makefile

My makefile looks as follows:

# Makefile

$(info    CUDA_HOME is $(CUDA_HOME))
TF_INC = `python -c "import tensorflow; print(tensorflow.sysconfig.get_include())"`
TF_LIB = `python -c "import tensorflow; print(tensorflow.sysconfig.get_lib())"`
#ifndef CUDA_HOME
#    CUDA_HOME := /usr/local/cuda
#endif
#CUDA_HOME_C=${CUDA_HOME}

CC        = gcc -O2 -pthread
CXX       = g++
GPUCC     = nvcc --expt-relaxed-constexpr
CFLAGS    = -std=c++11 -I$(TF_INC) -I"$(CUDA_HOME)/include" -DNDEBUG -D_GLIBCXX_USE_CXX11_ABI=0
GPUCFLAGS = -c
LFLAGS    = -pthread -shared -fPIC
GPULFLAGS = -x cu -Xcompiler -fPIC
CGPUFLAGS = -L$(CUDA_HOME)/lib -L$(CUDA_HOME)/lib64 -lcudart -L$(TF_LIB) -ltensorflow_framework

OUT_DIR   = src/ops/build
PREPROCESSING_SRC = "src/ops/preprocessing/preprocessing.cc" "src/ops/preprocessing/kernels/flow_augmentation.cc" "src/ops/preprocessing/kernels/augmentation_base.cc" "src/ops/preprocessing/kernels/data_augmentation.cc"
GPU_SRC_DATA_AUG        = src/ops/preprocessing/kernels/data_augmentation.cu.cc
GPU_SRC_FLOW            = src/ops/preprocessing/kernels/flow_augmentation_gpu.cu.cc
GPU_PROD_DATA_AUG       = $(OUT_DIR)/data_augmentation.o
GPU_PROD_FLOW           = $(OUT_DIR)/flow_augmentation_gpu.o
PREPROCESSING_PROD      = $(OUT_DIR)/preprocessing.so

DOWNSAMPLE_SRC = "src/ops/downsample/downsample_kernel.cc" "src/ops/downsample/downsample_op.cc"
GPU_SRC_DOWNSAMPLE  = src/ops/downsample/downsample_kernel_gpu.cu.cc
GPU_PROD_DOWNSAMPLE = $(OUT_DIR)/downsample_kernel_gpu.o
DOWNSAMPLE_PROD         = $(OUT_DIR)/downsample.so

CORRELATION_SRC = "src/ops/correlation/correlation_kernel.cc" "src/ops/correlation/correlation_grad_kernel.cc" "src/ops/correlation/correlation_op.cc"
GPU_SRC_CORRELATION  = src/ops/correlation/correlation_kernel.cu.cc
GPU_SRC_CORRELATION_GRAD  = src/ops/correlation/correlation_grad_kernel.cu.cc
GPU_SRC_PAD = src/ops/correlation/pad.cu.cc
GPU_PROD_CORRELATION = $(OUT_DIR)/correlation_kernel_gpu.o
GPU_PROD_CORRELATION_GRAD = $(OUT_DIR)/correlation_grad_kernel_gpu.o
GPU_PROD_PAD = $(OUT_DIR)/correlation_pad_gpu.o
CORRELATION_PROD        = $(OUT_DIR)/correlation.so

FLOWWARP_SRC = "src/ops/flow_warp/flow_warp_op.cc" "src/ops/flow_warp/flow_warp.cc" "src/ops/flow_warp/flow_warp_grad.cc"
GPU_SRC_FLOWWARP = "src/ops/flow_warp/flow_warp.cu.cc"
GPU_SRC_FLOWWARP_GRAD = "src/ops/flow_warp/flow_warp_grad.cu.cc"
GPU_PROD_FLOWWARP = "$(OUT_DIR)/flow_warp_gpu.o"
GPU_PROD_FLOWWARP_GRAD = "$(OUT_DIR)/flow_warp_grad_gpu.o"
FLOWWARP_PROD = "$(OUT_DIR)/flow_warp.so"

ifeq ($(OS),Windows_NT)
    detected_OS := Windows
else
    detected_OS := $(shell sh -c 'uname -s 2>/dev/null || echo not')
endif
ifeq ($(detected_OS),Darwin)  # Mac OS X
        CGPUFLAGS += -undefined dynamic_lookup
endif
ifeq ($(detected_OS),Linux)
        CFLAGS += -D_MWAITXINTRIN_H_INCLUDED -D_FORCE_INLINES -D__STRICT_ANSI__ -D_GLIBCXX_USE_CXX11_ABI=0
endif

all: preprocessing downsample correlation flowwarp

preprocessing:
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_DATA_AUG) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_DATA_AUG)
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_FLOW) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_FLOW)
        $(CXX) -g $(CFLAGS)  $(PREPROCESSING_SRC) $(GPU_PROD_DATA_AUG) $(GPU_PROD_FLOW) $(LFLAGS) $(CGPUFLAGS) -o $(PREPROCESSING_PROD)

downsample:
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_DOWNSAMPLE) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_DOWNSAMPLE)
        $(CXX) -g $(CFLAGS)  $(DOWNSAMPLE_SRC) $(GPU_PROD_DOWNSAMPLE) $(LFLAGS) $(CGPUFLAGS) -o $(DOWNSAMPLE_PROD)

correlation:
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_CORRELATION) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_CORRELATION)
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_CORRELATION_GRAD) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_CORRELATION_GRAD)
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_PAD) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_PAD)
        $(CXX) -g $(CFLAGS)  $(CORRELATION_SRC) $(GPU_PROD_CORRELATION) $(GPU_PROD_CORRELATION_GRAD) $(GPU_PROD_PAD) $(LFLAGS) $(CGPUFLAGS) -o $(CORRELATION_PROD)

flowwarp:
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_FLOWWARP) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_FLOWWARP)
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_FLOWWARP_GRAD) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_FLOWWARP_GRAD)
        $(CXX) -g $(CFLAGS)  $(FLOWWARP_SRC) $(GPU_PROD_FLOWWARP) $(GPU_PROD_FLOWWARP_GRAD) $(LFLAGS) $(CGPUFLAGS) -o $(FLOWWARP_PROD)

clean:
        rm -f $(PREPROCESSING_PROD) $(GPU_PROD_FLOW) $(GPU_PROD_DATA_AUG) $(DOWNSAMPLE_PROD) $(GPU_PROD_DOWNSAMPLE)

Environment

I am working remotely in a cluster (SLURM-based, loading modules instead of installing packages, etc.) with the following characteristics:

OS: Debian GNU/Linux 8 (jessie)
And I have loaded the following version of the required libraries (numpy, scipy, etc. are included with the module for python):
Tensorflow GPU: 1.10.0
Python 3.6.4
Tkinter 3.6.4
pypng 0.0.19
GCC 6.3.0-2.27

I have tried other versions of tensorflow-gpu (1.5.0 and 1.12.0) with the same results.
One thing I have noticed is that in the cluster, inside the CUDA_HOME, there is no lib folder but only lib64

As commented above, I have tried a combination of different proposed solutions without success and now I am running out of ideas although I fear it is related to problems of working in a cluster and loading modules (had to remove -DGOOGLE_CUDA=1 in order to compile successfully as suggested by the cluster technical staff).

Additionally, if I remove -DNDEBUG -D_GLIBCXX_USE_CXX11_ABI=0 from the flags, the same error rises after successful compilation.

Thanks for your time! Any help would be greatly appreciated. I'll keep this post updated if I try anything different.

Cheers,
Ferran.

UPDATE: since I did not found many cases about this error code containing "gpudevice" I am currently trying to include -DGOOGLE_CUDA=1 again because I think the former error is related to not finding any GPU. Now I receive the "cuda/include/cuda.h" : no such file or directory as in issue #45 but the resolution there does not fix my problem. I will keep investigating since solutions like changing the header that produces the error is not possible as I am working on a cluster without writing access to such files.

The text was updated successfully, but these errors were encountered:

fperezgamonal · 2019-05-04T11:05:03Z

Final update: after fighting with it for quite a few days and with help with my university's IT staff, I got it solved. A soft link for cuda.h was the solution (and keep the Makefile as shown above if I am not mistaken).

I will close this issue now, feel free to open it if you encounter a similar problem and I'll try to help you as much as possible.

Cheers.

seni04 · 2019-08-12T14:04:00Z

Final update: after fighting with it for quite a few days and with help with my university's IT staff, I got it solved. A soft link for cuda.h was the solution (and keep the Makefile as shown above if I am not mistaken).

I will close this issue now, feel free to open it if you encounter a similar problem and I'll try to help you as much as possible.

Cheers.

Hello sir, what do you mean by "A soft link for cuda.h was the solution"

how you do it ?

fperezgamonal · 2019-08-12T22:45:13Z

Hello @seni04 the technical stuff told me they had fixed by creating a soft link between the actual cuda version on the PC and the "standard" path where it is normally installed.

I assume they did something like:

ln -s /usr/bin/cuda-10.0 /usr/bin/cuda
But using the actual path where you installed CUDA as the first argument.
I'm sorry I cannot give you more details but I've just checked my IT tickets and found no extra details.
I hope this helps you,
PS: here is the actual (last) Makefile I used in any case (rename it back to Makefile)
Makefile.txt

Cheers,

Ferran.

seni04 · 2019-08-12T23:57:45Z

Hello @seni04 the technical stuff told me they had fixed by creating a soft link between the actual cuda version on the PC and the "standard" path where it is normally installed.

I assume they did something like:

ln -s /usr/bin/cuda-10.0 /usr/bin/cuda
But using the actual path where you installed CUDA as the first argument.
I'm sorry I cannot give you more details but I've just checked my IT tickets and found no extra details.
I hope this helps you,
PS: here is the actual (last) Makefile I used in any case (rename it back to Makefile)
Makefile.txt

Cheers,

Ferran.

nvcc -c --expt-relaxed-constexpr -g -std=c++11 -DNDEBUG -I/usr/local/lib/python2.7/dist-packages/tensorflow/include -I"/usr/local/cuda-9.0/include" -DGOOGLE_CUDA=1 -D_MWAITXINTRIN_H_INCLUDED -D_FORCE_INLINES -D__STRICT_ANSI__ -D_GLIBCXX_USE_CXX11_ABI=0 src/ops/preprocessing/kernels/data_augmentation.cu.cc -x cu -Xcompiler -fPIC -o src/ops/build/data_augmentation.o
In file included from /usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h:21:0,
from src/ops/preprocessing/kernels/data_augmentation.cu.cc:7:
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/util/cuda_device_functions.h:32:31: fatal error: cuda/include/cuda.h: No such file or directory
compilation terminated.
Makefile:68: recipe for target 'preprocessing' failed
make: *** [preprocessing] Error 1

iam still get this error, already using the same makefile like yours

fperezgamonal · 2019-08-16T15:02:04Z

Hello again,

I am very sorry to see you are still facing the same issues. I totally understand your frustration since I was totally unable to successfully compile the ops in another computer to try to run more experiments in parallel (and I had the same configuration and Makefile!).

The only thing I can thing of is searching for this error since it is very reoccurring and try some of the proposed solutions and see if it works.
By the way, if you happen to solve this issue and run into a missing library (libcupti), I have just how I solved that. I did so by adding the path to the library to the LD_LIBRARY_PATH environment variable , as follows:
export LD_LIBRARY_PATH=/soft/easybuild/debian/8.8/Broadwell/software/CUDA/9.0.176/extras/CUPTI/lib64:$LD_LIBRARY_PATH

If I can find any more information on how to solve your error, I will post it here.
I wish you luck!

PS: I'll leave this open so more people can see this issue and hopefully provide a solution.
Cheers,
Feran.

Vedant2311 · 2020-05-20T08:14:19Z

Hi.

I am facing a similar issue as well. I am trying to run a pre-trained styleGAN model (https://github.com/NVlabs/stylegan2) on my JupyterLab in a Tensorflow 1.14 GPU environment.

So, when I try to run the python code python run_generator.py generate-images --network=gdrive:networks/stylegan2-ffhq-config-f.pkl --seeds=6600-6625 --truncation-psi=0.5 as given in the link, I get the following error:

tensorflow.python.framework.errors_impl.NotFoundError: /trainman-mount/trainman-storage-d2b580e4-067b-44d3-9be3-be48cc5f0d71/stylegan2/dnnlib/tflib/_cudacache/fused_bias_act_1ac15fee5b354fc0d3aa1e7f98502e64.so: undefined symbol: _ZN10tensorflow12OpDefBuilder6OutputESs

I have no idea what does this _ZN10tensorflow12OpDefBuilder6OutputESs mean, but seems similar to the one raised in this thread. I also tried finding solutions for this error but all of them revolve around modifying some Makefile and there doesn't seem to be any use of a makefile for my problem since I am just running python code.

Any help will be much appreciated :)

stefanuddenberg · 2020-06-15T04:59:50Z

I am facing the same issue. Trying to get this to work on my university's cluster and facing the same issue. I was able to get it working fine on my Windows machine, and my group has been able to get it to work on an EC2 instance, so I have no idea what the issue is exactly. From what I can tell, all the correct dependencies are installed... @Vedant2311 did you come up with a solution?

ahmedshingaly · 2021-01-04T02:55:20Z

Hi.

I am facing a similar issue as well. I am trying to run a pre-trained styleGAN model (https://github.com/NVlabs/stylegan2) on my JupyterLab in a Tensorflow 1.14 GPU environment.

So, when I try to run the python code python run_generator.py generate-images --network=gdrive:networks/stylegan2-ffhq-config-f.pkl --seeds=6600-6625 --truncation-psi=0.5 as given in the link, I get the following error:

tensorflow.python.framework.errors_impl.NotFoundError: /trainman-mount/trainman-storage-d2b580e4-067b-44d3-9be3-be48cc5f0d71/stylegan2/dnnlib/tflib/_cudacache/fused_bias_act_1ac15fee5b354fc0d3aa1e7f98502e64.so: undefined symbol: _ZN10tensorflow12OpDefBuilder6OutputESs

I have no idea what does this _ZN10tensorflow12OpDefBuilder6OutputESs mean, but seems similar to the one raised in this thread. I also tried finding solutions for this error but all of them revolve around modifying some Makefile and there doesn't seem to be any use of a makefile for my problem since I am just running python code.

Any help will be much appreciated :)

In file stylegan2/dnnlib/tflib/custom_ops.py, line 127:
change from
compile_opts += ’ --compiler-options \’-fPIC -D_GLIBCXX_USE_CXX11_ABI=0\’’
to
compile_opts += ’ --compiler-options \’-fPIC -D_GLIBCXX_USE_CXX11_ABI=1\’’

AliRashidnejad · 2021-01-28T02:05:03Z

Hi.
I am facing a similar issue as well. I am trying to run a pre-trained styleGAN model (https://github.com/NVlabs/stylegan2) on my JupyterLab in a Tensorflow 1.14 GPU environment.
So, when I try to run the python code python run_generator.py generate-images --network=gdrive:networks/stylegan2-ffhq-config-f.pkl --seeds=6600-6625 --truncation-psi=0.5 as given in the link, I get the following error:

tensorflow.python.framework.errors_impl.NotFoundError: /trainman-mount/trainman-storage-d2b580e4-067b-44d3-9be3-be48cc5f0d71/stylegan2/dnnlib/tflib/_cudacache/fused_bias_act_1ac15fee5b354fc0d3aa1e7f98502e64.so: undefined symbol: _ZN10tensorflow12OpDefBuilder6OutputESs

I have no idea what does this _ZN10tensorflow12OpDefBuilder6OutputESs mean, but seems similar to the one raised in this thread. I also tried finding solutions for this error but all of them revolve around modifying some Makefile and there doesn't seem to be any use of a makefile for my problem since I am just running python code.
Any help will be much appreciated :)

In file stylegan2/dnnlib/tflib/custom_ops.py, line 127:
change from
compile_opts += ’ --compiler-options \’-fPIC -D_GLIBCXX_USE_CXX11_ABI=0\’’
to
compile_opts += ’ --compiler-options \’-fPIC -D_GLIBCXX_USE_CXX11_ABI=1\’’

thanks ahmedshingaly, this solved the similar issue for me

justusgraham · 2021-02-17T19:41:37Z

Also solved the issue for me. Would've been impossible to debug; thank you!

fperezgamonal closed this as completed May 4, 2019

fperezgamonal reopened this Aug 12, 2019

kwhuang88228 mentioned this issue Feb 10, 2022

Setting up TensorFlow plugin 'fused_bias_act.cu': Loading... Failed! dorarad/gansformer#32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different error when running test: Undefined symbol: _ZN10tensorflow3PadERKN5Eigen9GpuDeviceEPKfiiiiiiPf #93

Different error when running test: Undefined symbol: _ZN10tensorflow3PadERKN5Eigen9GpuDeviceEPKfiiiiiiPf #93

fperezgamonal commented Apr 22, 2019 •

edited

Loading

fperezgamonal commented May 4, 2019

seni04 commented Aug 12, 2019

fperezgamonal commented Aug 12, 2019 •

edited

Loading

seni04 commented Aug 12, 2019

fperezgamonal commented Aug 16, 2019

Vedant2311 commented May 20, 2020

stefanuddenberg commented Jun 15, 2020

ahmedshingaly commented Jan 4, 2021 •

edited

Loading

AliRashidnejad commented Jan 28, 2021

justusgraham commented Feb 17, 2021

Different error when running test: Undefined symbol: _ZN10tensorflow3PadERKN5Eigen9GpuDeviceEPKfiiiiiiPf #93

Different error when running test: Undefined symbol: _ZN10tensorflow3PadERKN5Eigen9GpuDeviceEPKfiiiiiiPf #93

Comments

fperezgamonal commented Apr 22, 2019 • edited Loading

Makefile

Environment

fperezgamonal commented May 4, 2019

seni04 commented Aug 12, 2019

fperezgamonal commented Aug 12, 2019 • edited Loading

seni04 commented Aug 12, 2019

fperezgamonal commented Aug 16, 2019

Vedant2311 commented May 20, 2020

stefanuddenberg commented Jun 15, 2020

ahmedshingaly commented Jan 4, 2021 • edited Loading

AliRashidnejad commented Jan 28, 2021

justusgraham commented Feb 17, 2021

fperezgamonal commented Apr 22, 2019 •

edited

Loading

fperezgamonal commented Aug 12, 2019 •

edited

Loading

ahmedshingaly commented Jan 4, 2021 •

edited

Loading