Jekyll2017-08-11T22:19:32-07:00https://sunshineatnoon.github.io/Xueting Li’s WebsiteCarpe DiemXueting LiA Brief Summary on Neural Style Transfer2017-05-19T00:00:00-07:002017-05-19T00:00:00-07:00https://sunshineatnoon.github.io/posts/2017/05/Brief-Summary-on-Neural-Style-Transfer<figure>
<img src="/assets/posts/2017-05-19-a-brief-summary-on-neural-style-transfer/Development.png" height="200" />
<figcaption>Figure 1 A Brief Sketch about the Development of Neural Style Transfer in Recent Years.</figcaption>
</figure>
<p>Neural Style Transfer is a fascinating yet mysterious area in computer vision. The opening paper by <a href="https://arxiv.org/pdf/1508.06576.pdf">Leon A. Gatys</a> et al. was published at 2015. Since then, numerous progress has been made in this area, bringing more and more fascinating features as well as enlightenment about how and why it works. In this blog post, I would like to summary several papers on this topic to draw a brief sketch about the developments in recent years as shown in Figure 1.</p>
<h2 id="opening-paper">Opening Paper</h2>
<figure>
<img src="/assets/posts/2017-05-19-a-brief-summary-on-neural-style-transfer/1.png" />
<figcaption>Figure 2 Artistic Style Transfer. From left to right: content image, style image, transferred result.</figcaption>
</figure>
<p>The opening paper <a href="https://arxiv.org/pdf/1508.06576.pdf">A Neural Algorithm of Artistic Style</a> and <a href="http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf">Image Style Transfer Using Convolutional Neural Networks</a> were out at 2015. Thanks to the dramatic development on CNNs (Convolutional Neural Networks), Leon A. Gatys et al. were able to separately capture image content and style using pre-trained CNN for object recognition and transfer style from one image to the other.</p>
<p><strong>Task.</strong> In neural style transfer, one provides two images, a content image (left in Figure 2) and a style image (middle in Figure 2). Then the style transfer algorithm combines the style of the style image and content of the content image to generate a new image(right in Figure 2). This requires the separately capture of image content and style.</p>
<p><strong>Capturing Image Content.</strong> CNNs are known as brilliant feature extractors. It captures low-level pixel information in shallow layers as well as high-level semantic information in deep layers. So to capture the image content information, it’s sufficient to use the feature maps by CNNs to represent image content. Thus we define the content loss as the mean square distance between the feature maps:</p>
<script type="math/tex; mode=display">L_{content}(\vec(p),\vec(x),l) = \frac{1}{2}\sum_{i,j}(F_{ij}^2 - P_{ij}^2)^2</script>
<p>In which <script type="math/tex">\vec{p}</script> and <script type="math/tex">\vec{x}</script> are the original image and the generated image. <script type="math/tex">F</script> and <script type="math/tex">P</script> are their feature maps from the pre-trained VGG net in layer <script type="math/tex">l</script>.</p>
<p><strong>Capturing Image Style.</strong> Image style is a ill-posed definition. It’s even hard to give a specific definition verbally. An intuition is that the style of a painting(or a painter) should be irrelevant to content as well as location, so one might use a statistics over the whole feature map to represent the style of an image. This introduces the Gram Matrix which is defined as:</p>
<script type="math/tex; mode=display">G_{ij}^l = \sum_{k}F_{ik}^lF_{jk}^l</script>
<p>Given a <script type="math/tex">C \times H \times W</script> feature map. Its corresponding Gram Matrix is a <script type="math/tex">C \times C</script> matrix, with <script type="math/tex">C_{ij}</script> computed as the element-wise product and summation across feature map in channel <script type="math/tex">i</script> and channel <script type="math/tex">j</script>. So it can be interpreted as the correlation between feature map <script type="math/tex">F_i</script> and <script type="math/tex">F_j</script>.</p>
<p>We define the style loss as a weighted sum of the mean square errors between the Gram Matrices computed in layer <script type="math/tex">l_1, l_2, ..., l_n</script>.</p>
<script type="math/tex; mode=display">E_l = \frac{1}{4N_l^2M_l^2}\sum_{i,j}(G_{ij}^l - A_{ij}^l)^2</script>
<script type="math/tex; mode=display">L_{style}(\vec{a},\vec{x}) = \sum_{l=0}^{l}w_lE_l</script>
<p><strong>Training.</strong> The first Neural Style Transfer Algorithm is characterized as an optimization problem. Starting with a noise image, one feeds this image to the pre-train VGG net and calculate the style loss and content loss between this image and the style and content image respectively, then backpropagates the gradient all the way back to this noise image to push it closer to our target image. Repeating this optimization process many iterations, we shall get an image combining the content and style from the content image and style image. An example result is shown in Figure 2.</p>
<p><strong>Drawbacks.</strong> This opening paper shows the a fantastic application of CNNs. Yet it tackles this problem by an optimization process, so it’s inefficient and can’t be applied in real world such as a phone APP. Also, why Gram Matrix can capture style of an image is kind of a myth to be revealed.</p>
<h2 id="feed-forward-style-transfer">Feed-Forward Style Transfer</h2>
<figure>
<img src="/assets/posts/2017-05-19-a-brief-summary-on-neural-style-transfer/FeedForward.png" />
<figcaption>Figure 3 Network Architecture of Feed-Forward Style Transfer</figcaption>
</figure>
<p>One issue about neural style transfer is that it needs many optimization iterations to generate a transferred image. <a href="https://arxiv.org/pdf/1603.08155.pdf">Johnson et al</a>. tackled this problem by training a network to complete the style transfer task. Their network architecture is shown in Figure 3. Using the same style and content loss as the opening paper, their transformer network receives a content image <script type="math/tex">x</script> and generates a target image <script type="math/tex">\hat{y}</script>, <script type="math/tex">\hat{y}</script> as well as the style image <script type="math/tex">y_s</script> and content image <script type="math/tex">y_c</script> are then fed into the pre-trained VGG net to compute the style and content losses. Gradients from these two losses are back-propagated into the generated image and then used to train the transformer network. When testing, one only needs to feed the content image into the transformer network to get a transferred image. Thus the transfer task can be completed within one feed forward other than many feed-forward and backward iterations.</p>
<h2 id="demystify-neural-style-transfer">Demystify Neural Style Transfer</h2>
<p>Why does Gram Matrix work? <a href="https://arxiv.org/pdf/1701.01036.pdf">Li et al.</a> proved that minimizing the style loss based on Gram Matrix is equivalent to minimizing the Maximum Mean Discrepancy with the second order polynomial kernel. Thus the essence of neural style transfer is to match the feature distributions between the style images and the generated images. Other than the Maximum Mean Discrepancy with the second order polynomial kernel, the optimization can also achieved by other kernels such as the linear kernel and Gaussian kernel.</p>
<p>Additionally, they also proposed a style loss by aligning the BN statistics (mean and standard deviation) of two feature maps between the style images and the generated images.</p>
<script type="math/tex; mode=display">L_{style}^l = \frac{1}{N_l}((\mu_{F^l}^i - \mu_{S^l}^i)^2 - (\sigma_{F^l}^i - \sigma_{S^l}^i)^2)</script>
<p>where <script type="math/tex">\mu_{F^l}^i</script> and <script type="math/tex">\sigma_{F^l}^i</script> are the mean and standard deviation of the <script type="math/tex">i^{th}</script> feature channel among all the positions of the feature map in the layer l for the generated image, while <script type="math/tex">\mu_{S^l}^i</script> and <script type="math/tex">\sigma_{S^l}^i</script> correspond to the style image. Image generated by minimizing the BN statistics based loss is shown in Figure 4. It has comparable performance with the Gram Matrix based loss.</p>
<figure>
<img src="/assets/posts/2017-05-19-a-brief-summary-on-neural-style-transfer/BN.png" />
<figcaption>Figure 4 Image generated by aligning the BN statistics.</figcaption>
</figure>
<h2 id="multi-style-transfer">Multi-style Transfer</h2>
<p>Although fast neural style transfer accelerate the speed of Neural Style Transfer, one has to train a transformer network for each style. <a href="https://arxiv.org/pdf/1610.07629.pdf">Dumoulin et al.</a> observed that the parameters <script type="math/tex">\gamma</script> and <script type="math/tex">\beta</script> in the <a href="https://arxiv.org/abs/1607.08022">Instance Normalization</a> layer can shift and scale an activation <script type="math/tex">z</script> specific to painting style <script type="math/tex">s</script>. Conventional Instance Normalization can be represented as :</p>
<script type="math/tex; mode=display">z = {\gamma}(\frac{x - \mu}{\sigma}) + {\beta}</script>
<p>where <script type="math/tex">x</script> is the feature map of the input content image from layer l, <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> are <script type="math/tex">x</script>’s mean and standard deviation taken across spatial axes. <script type="math/tex">\gamma</script> and <script type="math/tex">\beta</script> are learned parameters to shift and scale this feature map.</p>
<p>While in the proposed method which was called Conditional Instance Normalization, <script type="math/tex">\gamma</script> and <script type="math/tex">\beta</script> are expanded to <script type="math/tex">N \times C</script> matrix, with <script type="math/tex">N</script> represents the number of styles and <script type="math/tex">C</script> represents the channel number. To normalize <script type="math/tex">x</script> to a specific style <script type="math/tex">s</script>, one just normalize <script type="math/tex">x</script> as:</p>
<script type="math/tex; mode=display">z = {\gamma}_s(\frac{x - \mu}{\sigma}) + {\beta}_s</script>
<p>where <script type="math/tex">{\gamma}_s</script> and <script type="math/tex">{\beta}_s</script> is the <script type="math/tex">s^{th}</script> row of <script type="math/tex">\gamma</script> and <script type="math/tex">beta</script>, specializing to represent style <script type="math/tex">s</script>. The process can be figured below:</p>
<figure>
<img src="/assets/posts/2017-05-19-a-brief-summary-on-neural-style-transfer/CIN.png" />
<figcaption>Figure 5 Conditional Instance Normalization.</figcaption>
</figure>
<p>This paper confirms the observation from <a href="#demystify-neural-style-ransfer">Demystify Neural Style Transfer</a> that the the Instance Normalization (A specific process of Batch Normalization when batch size equals to 1) statistics can indeed represent a style.</p>
<p>Another interesting paper is by <a href="https://arxiv.org/abs/1703.09210">Chen et al.</a>. They trained a feed-forward style transfer network on multiple styles by using Styles Bank. A style bank is a composition of multiple convolutional filters, with each corresponds explicitly to a style. To transfer an image into a specific style, the corresponding filter bank is operated on top of the intermediate feature embedding produced by a single auto-encoder. Their network architecture is shown in Figure 6:</p>
<figure>
<img src="/assets/posts/2017-05-19-a-brief-summary-on-neural-style-transfer/styleBank.png" />
<figcaption>Figure 6 Network Architecture of Styles Bank.</figcaption>
</figure>
<p>An image <script type="math/tex">I</script> is fed into the auto-encoder <script type="math/tex">\varepsilon</script> to get feature map <script type="math/tex">F</script>. <script type="math/tex">F</script> is then convolved with each filter bank <script type="math/tex">K_i</script> to get new feature maps <script type="math/tex">\tilde{F_i}</script> specific to style <script type="math/tex">i</script>. Those feature maps are then fed into the decoder <script type="math/tex">D</script> to get all transferred images. To train the auto-encoder as well as styles bank, two losses are utilized. One is the perceptual loss same as <a href="#feed-forward-style-transfer">Feed-Forward Style Transfer</a>, the other is the MSE loss of image reconstruction. This significantly reduce parameters needed for each style, from a whole transformer network to just a convolution filter. Also it supports incremental style addition by fixing the auto-encoder and training new filter specific to the new style from scratch.</p>
<h2 id="arbitrary-style-transfer">Arbitrary Style Transfer</h2>
<p>A farther step towards Multi-style transfer is arbitrary style transfer. Although conditional instance normalization can enable a transformer network to learn multiple styles, the transformer network’s capability has to increase as the number of styles it captures. A super goal of style transfer is that given a content image as well as a style image, one can get the transferred image within a single feed-forward process without a pre-trained transformer network on this style. <a href="https://arxiv.org/abs/1703.06868">Xun et al.</a> proposed Adaptive Instance Normalization to achieve this. What we know from <a href="#multi-style-transfer">Multi-style Transfer</a> is that the <script type="math/tex">\gamma</script> and <script type="math/tex">\beta</script> in the Instance Normalization layer can shift and scale an activation <script type="math/tex">z</script> specific to painting style <script type="math/tex">s</script>. So why not just shift and scale the content image’s feature map to align with the style image in the training process ? This Adaptive Instance Normalization does exactly this:</p>
<script type="math/tex; mode=display">AdaIN(x,y) = \sigma(y)(\frac{x-\mu(x)}{\sigma(x)}) + \mu(y)</script>
<p>where <script type="math/tex">x</script> is the feature map of the input content image. <script type="math/tex">\mu(x)</script>, <script type="math/tex">\mu(y)</script> and <script type="math/tex">\sigma(x)</script>, <script type="math/tex">\sigma(y)</script> are the mean and standard deviation of <script type="math/tex">x</script> and style image <script type="math/tex">y</script>.</p>
<p>Their network architecture is slightly different from the feed-forward style transfer and multi-style-transfer transfer. The encoder is composed of few layers of a fixed VGG-19 network. An AdaIN layer is used to perform style transfer in the feature space. A decoder is learned to invert the AdaIN output to the image spaces.</p>
<figure>
<img src="/assets/posts/2017-05-19-a-brief-summary-on-neural-style-transfer/AdaIN.png" />
<figcaption>Figure 7 Network Architecture of Arbitrary Style Transfer.</figcaption>
</figure>
<h2 id="controllable-style-transfer">Controllable Style Transfer</h2>
<p>The last line of style transfer I would like to go through is style transfer that preserving color or other controllable factors such as brush stroke. I will mainly talk about color preserving, details about other controllable factors can be found in <a href="https://arxiv.org/abs/1611.07865">Preserving color in neural artistic style transfer</a>. Color preserving style transfer means to preserve the color in content image while transfer the texture style from the style image to the content image. There are two approaches proposed by <a href="https://arxiv.org/abs/1606.05897">Gatys et al.</a>. The first one is that for the style image <script type="math/tex">S</script> , first use color histogram matching algorithm to generate a new style image <script type="math/tex">S'</script> which matches the color histogram of the content image. Then use <script type="math/tex">S'</script> as the style image and perform conventional neural style transfer algorithm with content image <script type="math/tex">C</script>. The other approach is to perform style transfer only in the luminance channel, then concatenate generated luminance channel with I and Q channel from the content image. Results are shown in Figure 8. From my own experience, the luminance-only style transfer performs better than the histogram matching algorithm since the latter one relies on how histogram matching algorithm works.</p>
<figure>
<img src="/assets/posts/2017-05-19-a-brief-summary-on-neural-style-transfer/colorPreserving.png" />
<figcaption>Figure 8 Results of Color Preserving Style Transfer.</figcaption>
</figure>
<h2 id="references">References</h2>
<ol>
<li>Gatys L A, Ecker A S, Bethge M. A neural algorithm of artistic style[J]. arXiv preprint arXiv:1508.06576, 2015.</li>
<li>Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution[J]. arXiv preprint arXiv:1603.08155, 2016.</li>
<li>Li, Yanghao, et al. “Demystifying Neural Style Transfer.” arXiv preprint arXiv:1701.01036 (2017).</li>
<li>Gatys, Leon A., et al. “Preserving color in neural artistic style transfer.” arXiv preprint arXiv:1606.05897 (2016).</li>
<li>Gatys, Leon A., et al. “Controlling Perceptual Factors in Neural Style Transfer.” arXiv preprint arXiv:1611.07865 (2016).</li>
<li>Dumoulin, Vincent, Jonathon Shlens, and Manjunath Kudlur. “A learned representation for artistic style.” (2017).</li>
<li>Huang, Xun, and Serge Belongie. “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization.” arXiv preprint arXiv:1703.06868 (2017).</li>
<li>Chen, Dongdong, et al. “Stylebank: An explicit representation for neural image style transfer.” arXiv preprint arXiv:1703.09210 (2017).</li>
</ol>Xueting LiFigure 1 A Brief Sketch about the Development of Neural Style Transfer in Recent Years.How to Train Fast RCNN on ImageNet2015-12-19T00:00:00-08:002015-12-19T00:00:00-08:00https://sunshineatnoon.github.io/posts/2012/08/how-to-train-fast-rcnn-on-imagenet<p>I’ve been playing with <a href="https://github.com/rbgirshick/fast-rcnn">fast-rcnn</a> for a while. This amazing and wonderful project helps me understand more about deep learning and its beautiful power. However, there’s only a pre-trained fast rcnn model for pascal voc with 20 classes. To use this project in real applications, I need to train a model on the <a href="http://www.image-net.org/">ImageNet</a> detection dataset( For time’s sake, I only chose two classes out of 200 classes). So this blog records what to be done to train a fast rcnn on ImangeNet.</p>
<h2 id="prepare-dataset">Prepare Dataset</h2>
<p>The organization of my dataset is like this:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>imagenet
|-- data
|-- train.mat
|-- Annotations
|-- *.xml (Annotation files)
|-- Images
|-- *.JPEG (Image files)
|-- ImageSets
|-- train.txt
</code></pre>
</div>
<ul>
<li>train.mat: This is the selective search proposals file</li>
<li>Annotations: This folder contains all annotation files of the images</li>
<li>Images: This folder contains all images</li>
<li>ImageSets: This folder only contains one file–trian.txt, which contains all the names of the images. It looks like this:</li>
</ul>
<div class="highlighter-rouge"><pre class="highlight"><code>n02769748_18871
n02769748_2379
n02958343_4294
...
</code></pre>
</div>
<h2 id="construct-imdb-file">Construct IMDB File</h2>
<p>We need to create a file imagenet.py in the directory <code class="highlighter-rouge">$FRCNN_ROOT/lib/datasets</code>. This file defines some functions which tell fast rcnn how to read ground truth boxes and how to find images on disk. I mainly changed these functions:</p>
<h4 id="__init__selfimage_setdevkit_path">__init__(self,image_set,devkit_path):</h4>
<p>This function is easy to modify, only two lines need to be changed:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>self._classes = ('__background__','n02958343','n02769748')
self._image_ext = '.JPEG'
</code></pre>
</div>
<p>These two lines specify classes and image extentions. For the sake of time, I only chose 2 classes out of 200 classes of the dataset. We need to pay attention to the names of these two classes because in our annotation files, the groud truth class is its number in the imagenet dataset such as n02958343 or n02769748, not its real name such as car or bakcpack.</p>
<h4 id="_load_imagenet_annotation">_load_imagenet_annotation</h4>
<p>This is an important function for our training, it tells fast rcnn how to read annotation files. But the imagenet annotation files are much like ones in pascal voc, so one can easily figure out this function. Though special attention should be paid to these four lines:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>x1 = float(get_data_from_tag(obj, 'xmin')) - 1
y1 = float(get_data_from_tag(obj, 'ymin')) - 1
x2 = float(get_data_from_tag(obj, 'xmax')) - 1
y2 = float(get_data_from_tag(obj, 'ymax')) - 1
</code></pre>
</div>
<p>This is because in the pascal voc dataset, all coordinates start from one, so in order to make them start from 0, we need to minus 1. But this is not true for imagenet, so we should not minus 1. The same goes for <code class="highlighter-rouge">box_list.append(raw_data[i][:, (1, 0, 3, 2)]-1) </code>. So we need to modify these lines to:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>x1 = float(get_data_from_tag(obj, 'xmin'))
y1 = float(get_data_from_tag(obj, 'ymin'))
x2 = float(get_data_from_tag(obj, 'xmax'))
y2 = float(get_data_from_tag(obj, 'ymax'))
box_list.append(raw_data[i][:, (1, 0, 3, 2)])
</code></pre>
</div>
<p>I didn’t do any change to other functions. My imagenet.py file can be found <a href="https://github.com/sunshineatnoon/fast-rcnn/blob/master/lib/datasets/imagenet.py">here</a> for a reference.</p>
<h4 id="factorypy">factory.py</h4>
<p>This file is easy to change, just add these lines to it:</p>
<div class="highlighter-rouge"><pre class="highlight"><code># Set up inria_ using selective search "fast" mode
import datasets.imagenet
imagenet_devkit_path = '/path/to/imagenet'
for split in ['train', 'test']:
name = '{}_{}'.format('imagenet', split)
__sets[name] = (lambda split=split: datasets.imagenet(split, imagenet_devkit_path))
</code></pre>
</div>
<p>This tells fast rcnn where we put our dataset.</p>
<h4 id="__init__py">__init__.py</h4>
<p>The only thing we need to change for this file is to import imagenet:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>from .imagenet import imagenet
</code></pre>
</div>
<h2 id="run-selective-search">Run Selective Search</h2>
<p>Since I don’t have MATLAB installed, so instead I use <a href="http://dlib.net/">dlib</a>’s slective search. This is a fast and convinient library for many computer vision algorithms. I use the file <a href="https://github.com/sunshineatnoon/fast-rcnn/blob/master/tools/generate_bbox.py">generate_bbox.py</a> to generate the train.mat file which gives all object proposals. To use this file, you need to specify the image path and image name path in the file:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>imagenet_path = '/path/to/imagnet/data/Images/'
names = '/path/to/imagenet/data/ImageSets/test.txt'
</code></pre>
</div>
<p>To run this file, just run <code class="highlighter-rouge">python generate_bbox.py</code>. Then you will find a train.mat file in the same folder, just copy the train.mat file to imagenet/data/.</p>
<h2 id="modify-prototxt-files">Modify Prototxt Files</h2>
<p>Sine we only have 3 classes(including background class), we need to change the network structure. For me, I trained this model on the pre-trained caffenet model. So I need to change <code class="highlighter-rouge">$FRCNN_ROOT/models/CaffeNet/train.prototxt</code> to fit my dataset.</p>
<ul>
<li>For the input layer, we need to change input class to 3: <code class="highlighter-rouge">param_str: "'num_classes': 3"</code></li>
<li>For the cls_score layer, we need to change output class to 3: <code class="highlighter-rouge">num_output: 3</code></li>
<li>For the bbox_pred layer, we need to change output to 3*4=12: <code class="highlighter-rouge">num_output: 12</code></li>
</ul>
<p>See the <a href="https://github.com/sunshineatnoon/fast-rcnn/blob/master/models/CaffeNet/train.prototxt">train.prototxt</a> file for reference.</p>
<h2 id="train">Train</h2>
<p>Training is easy, just run this command under <code class="highlighter-rouge">$FRCNN_ROOT</code>:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>./tools/train_net.py --gpu 0 --solver models/CaffeNet/solver.prototxt --weights data/imagenet_models/CaffeNet.v2.caffemodel --imdb imagenet_train
</code></pre>
</div>
<p>For me, it took me about 0.44s for each iteration for 40000 iterations. So the total time for training is about 5 hours.</p>
<h2 id="see-training-result">See Training Result</h2>
<p>I don’t really want to test the trained model for an accuracy, but instead view its performance intuitively.
So I just copy the trained model in <code class="highlighter-rouge">$FRCNN_ROOT/output/default/train/caffenet_fast_rcnn_iter_40000.caffemodel</code> to <code class="highlighter-rouge">/data/fast_rcnn_models/ </code> (Don’t forget to backup the old one). Then run <code class="highlighter-rouge">./tools/demo.py</code> under <code class="highlighter-rouge">$FRCNN_ROOT</code> to see how our trained model works. Of course, changes need to be made in demo.py file to use dlib’s selective search and the trained model. You can find mine <a href="https://github.com/sunshineatnoon/fast-rcnn/blob/master/tools/demo.py">here</a> for a reference.</p>
<h2 id="reference">Reference</h2>
<p>[1] <a href="https://github.com/zeyuanxy/fast-rcnn/tree/master/help/train">https://github.com/zeyuanxy/fast-rcnn/tree/master/help/train</a></p>
<p>[2] <a href="http://www.cnblogs.com/louyihang-loves-baiyan/p/4885659.html?utm_source=tuicool&utm_medium=referral">http://www.cnblogs.com/louyihang-loves-baiyan/p/4885659.html?utm_source=tuicool&utm_medium=referral</a></p>Xueting LiI’ve been playing with fast-rcnn for a while. This amazing and wonderful project helps me understand more about deep learning and its beautiful power. However, there’s only a pre-trained fast rcnn model for pascal voc with 20 classes. To use this project in real applications, I need to train a model on the ImageNet detection dataset( For time’s sake, I only chose two classes out of 200 classes). So this blog records what to be done to train a fast rcnn on ImangeNet.