TY - GEN
T1 - User video summarization based on joint visual and semantic affinity graph
AU - Lei, Zhuo
AU - Sun, Ke
AU - Zhang, Qian
AU - Qiu, Guoping
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/10/16
Y1 - 2016/10/16
N2 - Automatically generating summaries of user-generated videos is very useful but challenging. User-generated videos are unedited and usually only contain a long single shot which makes traditional video temporal segmentation methods such as shot boundary detection ineffective in producing meaningful video segments for summarization. To address this issue, we propose a novel temporal segmentation framework based on the clustering of joint visual and semantic affinity graph of the video frames. Based on a pre-trained deep convolutional neural network (CNN), we extract deep visual features of the frames to construct the visual affinity graph. We then construct the semantic affinity graph of the frames based on word embedding of the frames' semantic tags generated from an automatic image tagging algorithm. A dense neighbor method is then used to cluster the joint visual and semantic affinity graph to divide the video into subshot level segments and from which a summary of the video can be generated. Experimental results show that our approach outperforms state-of-the-art methods. Furthermore, we show that the method achieves results that are similar to those performed manually.
AB - Automatically generating summaries of user-generated videos is very useful but challenging. User-generated videos are unedited and usually only contain a long single shot which makes traditional video temporal segmentation methods such as shot boundary detection ineffective in producing meaningful video segments for summarization. To address this issue, we propose a novel temporal segmentation framework based on the clustering of joint visual and semantic affinity graph of the video frames. Based on a pre-trained deep convolutional neural network (CNN), we extract deep visual features of the frames to construct the visual affinity graph. We then construct the semantic affinity graph of the frames based on word embedding of the frames' semantic tags generated from an automatic image tagging algorithm. A dense neighbor method is then used to cluster the joint visual and semantic affinity graph to divide the video into subshot level segments and from which a summary of the video can be generated. Experimental results show that our approach outperforms state-of-the-art methods. Furthermore, we show that the method achieves results that are similar to those performed manually.
KW - Clustering
KW - Joint affinity graph
KW - User-generated video
KW - Video summarization
KW - Video temporal segmentation
UR - http://www.scopus.com/inward/record.url?scp=84995488319&partnerID=8YFLogxK
U2 - 10.1145/2983563.2983568
DO - 10.1145/2983563.2983568
M3 - Conference contribution
AN - SCOPUS:84995488319
T3 - Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016
SP - 45
EP - 52
BT - Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016
PB - Association for Computing Machinery, Inc
T2 - 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L-MM 2016
Y2 - 16 October 2016
ER -