Predicting Video Saliency with Object-to-Motion
CNN and Two-layer Convolutional LSTM
Over the past few years, deep neural networks (DNNs) have exhibited great success in predicting the saliency of images. However, there are few works that apply DNNs to predict the saliency of generic videos. In this paper, we propose a novel DNN-based video saliency prediction method. Specifically, we establish a large-scale eye-tracking database of videos (LEDOV), which provides sufficient data to train the DNN models for predicting video saliency. Through the statistical analysis of our LEDOV database, we find that human attention is normally attracted by objects, particularly moving objects or the moving parts of objects. Accordingly, we propose an object-to-motion convolutional neural network (OM-CNN) to learn spatio-temporal features for predicting the intra-frame saliency via exploring the information of both objectness and object motion. We further find from our database that there exists a temporal correlation of human attention with a smooth saliency transition across video frames. Therefore, we develop a two-layer convolutional long short-term memory (2C-LSTM) network in our DNN-based method, using the extracted features of OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can be generated, which consider the transition of attention across video frames. Finally, the experimental results show that our method advances the state-of-the-art in video saliency prediction.
The heat maps show that: (1) the regions with object can draw a majority of human attention, (2) the moving objects or the moving parts of objects attract more human attention, and (3) a dynamic pixel-wise transition of human attention occurs across video frames. For points (1) and (2), a novel OM-CNN is constructed to learn the features of object motion. For point (3), our 2C-LSTM network is capable of retaining spatial information of attention distribution with structured output through the convolutional connections.

Attention heat maps of some frames selected from two videos.
For video saliency prediction, we develop a new DNN architecture that combines OM-CNN and 2C-LSTM together. Inspired by the findings above, OM-CNN integrates both regions and motions of objects to predict video saliency through two subnets, i.e., the subnets of objectness and motion. In OM-CNN, the objectness subnet yields a coarse objectness map, which is used to mask the features output from the convolutional layers in the motion subnet. Then, the spatial features from the objectness subnet and temporal features from the motion subnet are concatenated to generate spatio-temporal features of OM-CNN. The architecture of OM-CNN is shown below. Besides, 2C-LSTM with Bayesian dropout is developed to learn dynamic saliency of video clips, in which the spatio-temporal features of OM-CNN work as the input. Finally, the saliency map of each frame is generated from 2 deconvolutional layers of 2C-LSTM. The architecture of 2CLSTM is also shown below.

Overall architecture of our OM-CNN for predicting video saliency of intra-frame.

Architecture of our 2C-LSTM for predicting saliency transition across inter-frame.

Saliency maps yielded by our and other 8 methods as well the ground-truth human fixations.

The averaged accuracy of saliency detection on test videos of cross-validation over our database.

The averaged accuracy of saliency detection SFU and DIEM.
Lai Jiang, Mai Xu, Zulin Wang. Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM. arXiv preprint arXiv:1709.06316. 2017.