This is a Torch implementation of Image Captioning using Multi-modal RNN that use both Word Embeddings and CNN features, as described in Mao et. al.
The implementation is incomplete and work in progress
Meanwhile check out the Tensorflow Implemetaion by J.Mao here