##Information

name: LSTM image captioning model based on CVPR 2015 paper "[Show and tell: A neural image caption generator](http://arxiv.org/abs/1411.4555)" and code from Karpathy's [NeuralTalk](https://github.com/karpathy/neuraltalk).

model_file: https://s3-us-west-1.amazonaws.com/nervana-modelzoo/image_caption_flickr8k.py

model_weights: https://s3-us-west-1.amazonaws.com/nervana-modelzoo/image_caption_flickr8k.p

neon_version: v1.0.rc1

neon_commit: 2169b093fbba0c189021a941d286c7a98c0c6c6c

gist_id: 7e76e90664f935c6f65d

##Description
The LSTM model is trained on the [flickr8k dataset](http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html) using precomputed VGG features from http://cs.stanford.edu/people/karpathy/deepimagesent/. Model details can be found in the following [CVPR-2015 paper](http://arxiv.org/abs/1411.4555):

    Show and tell: A neural image caption generator.
    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan.  
    CVPR, 2015 (arXiv ref. cs1411.4555)

The model was trained for 15 epochs where 1 epoch is 1 pass over all 5 captions of each image. Training data was shuffled each epoch. To evaluate on the test set, download the model and weights, and run:

        python image_caption.py --model_file [path_to_weights]

##Performance
For testing, the model is only given the image and must predict the next word until a stop token is predicted. Greedy search is currently used by just taking the max probable word each time. Using the bleu score evaluation script from https://raw.githubusercontent.com/karpathy/neuraltalk/master/eval/ and evaluating against 5 reference sentences the results are below.

| BLEU | Score |
| ---- | ----  |
| B-1  | 54.2  |
| B-2  | 32.6  |
| B-3  | 19.3  |
| B-4  | 12.3  |

A few things that were not implemented are beam search, l2 regularization, and ensembles. With these things, performance would be a bit better.