Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization

Abstract

诸如便携式相机和智能眼镜之类的可穿戴设备的出现使录制生活记录的第一人称视频成为可能。浏览这么长的非结构化视频既费时又乏味。本文研究了视频中用户最感兴趣或最感兴趣的时刻（即精彩时刻）的发现，以生成第一人称视频的摘要。具体来说，我们提出了一种新颖的成对深度排名模型，该模型采用深度学习技术来学习高光和非高光视频片段之间的关系。通过从视频帧的外观和跨帧的时间动态方面的补充信息中表示视频片段，开发出一种两流网络结构，用于视频高亮检测。给定配备了高光检测模型的较长的个人视频，则将高光得分分配给每个片段。所获得的精彩片段以两种方式应用于摘要：视频间隔拍摄和视频剪辑。前者以低（高）速度播放亮点（非突出）片段，而后者则组合得分最高的片段序列。在15个独特运动类别的100小时第一人称视频中，我们的重点检测功能使最新的RankSVM方法的准确性提高了10.5％。此外，我们的方法通过对35个人类受试者的用户研究产生了质量更高的视频摘要。

The emergence of wearable devices such as portable cameras and smart glasses makes it possible to record life logging first-person videos. Browsing such long unstructured videos is time-consuming and tedious. This paper studies the discovery of moments of user’s major or special interest (i.e., highlights) in a video, for generating the summarization of first-person videos. Specifically, we propose a novel pairwise deep ranking model that employs deep learning techniques to learn the relationship between highlight and non-highlight video segments. A two-stream network structure by representing video segments from complementary information on appearance of video frames and temporal dynamics across frames is developed for video highlight detection. Given a long personal video, equipped with the highlight detection model, a highlight score is assigned to each segment. The obtained highlight segments are applied for summarization in two ways: video timelapse and video skimming. The former plays the highlight (non-highlight) segments at low (high) speed rates, while the latter assembles the sequence of segments with the highest scores. On 100 hours of first-person videos for 15 unique sports categories, our highlight detection achieves theimprovement over the state-of-the-art RankSVM method by 10.5% in terms of accuracy. Moreover, our approaches produce video summary with better quality by a user study from 35 human subjects.

1. Introduction

可穿戴设备无处不在。人们每天都在使用这些设备拍摄第一人称视频。例如，GoPro相机和Google Glass等可穿戴式摄录机现在可以捕获高质量的第一人称视频，以记录我们的日常经验。这些第一人称视频通常非常无组织且运行时间长。浏览和编辑此类视频确实是一项繁琐的工作。因此，产生摘要视频的简短摘要并理想地封装其最有用的部分的视频摘要对于减轻第一人称视频浏览，编辑和索引的问题变得越来越重要。

Wearable devices have become pervasive. People are taking first-person videos using these devices everyday and everywhere. For example, wearable camcorders such as GoPro cameras and Google Glass are now able to capture high quality first-person videos for recording our daily experience. These first-person videos are usually extremely unstructured and long-running. Browsing and editing such videos is really a tedious job. Therefore, video summarization, which produces a short summary of a full-length video and ideally encapsulates its most informative parts, is becoming increasingly important for alleviating the problem of first-person video browsing, editing and indexing.

关于视频摘要的研究主要是从两个方面进行的，即基于关键帧或镜头的[15，18]和基于结构驱动的[17，22]。基于关键帧或镜头的方法总是通过优化摘要的多样性或代表性来选择关键帧或镜头的集合，而结构驱动的方法则在某些领域（例如，观众欢呼，运动视频中的目标或得分事件）进行汇总。通常，现有方法提供了从原始视频中采样压缩摘要的复杂方法，从而减少了用户查看所有内容所需的时间。

The research on video summarization has mainly proceeded along two dimensions, i.e., keyframe or shot-based [15, 18], and structure-driven [17, 22] approaches. The keyframe or shot-based method always selects a collection of keyframes or shots by optimizing diversity or representativeness of a summary, while structure-driven approach exploits a set of well-defined structures in certain domains (e.g., audience cheering, goal or score events in sports videos) for summarization. In general, existing approaches offer sophisticated ways to sample a condensed synopsis from the original video, reducing the time required for users to view all the contents.

但是，由于完全忽略了用户对视频的兴趣，在传统方法中将视频摘要定义为采样问题非常有限。结果，由于排除了摘要中多余部分的视觉多样性标准，可能会省略特殊时刻。当直接将这些方法应用于第一人称视频时，此限制尤其严重。使用可穿戴设备捕获的第一人称视频从第一人称视角记录了在不受限制的环境中的经历，从而使它们变得冗长，冗余且结构化。而且，此类视频的连续性甚至不会产生明显的镜头边界来进行总结。但是，原始视频中应该有一些主要或特殊兴趣（即重点）的时刻（段）。因此，我们的目标是通过探索视频中的亮点为第一人称视频摘要提供一个新的范例。

however, defining video summarization as a sampling problem in conventional approaches is very limited as users’ interests in a video are fully overlooked. As a result, the special moments may be omitted due to the visual diversity criteria of excluding redundant parts in a summary. The limitation is particularly severe when directly applying those methods to first-person videos. First-person videos captured with wearable devices record experiences from a first-person perspective in unconstrained environments, making them long, redundant and unstructured. Moreover, the continuous nature of such videos even yields no evident shot boundaries for summary; nevertheless, there should be the moments (segments) of major or special interest (i.e. highlights) in raw videos. Therefore, our goal is to provide a new paradigm for first-person video summarization by exploring the highlights in a video.

作为实现此目标的第一步，图1演示了检索原始第一人称视频中的精彩片段的过程。原始视频分为几个部分。给定片段的表示形式，每个片段的突出显示度量值等同于学习预测突出得分的功能。得分越高，该细分就越突出。因此，具有高分数的片段可以被选择为视频精彩片段。此外，为了结合空间和时间信息以更好地描绘视频片段，来自静态帧和跨多个帧的时间动态的视觉外观上的互补流被联合利用。因此，我们通过在每个流上融合DCNN来设计视频高亮检测，从而设计出一种两流深度卷积神经网络（DCNN）架构。特别地，考虑到高光得分仅表示每个视频内的相对兴趣程度，我们建议使用成对的深度排名模型独立地训练每个流上的DCNN，该模型通过一组对来表征相对关系。每对都包含同一视频的精彩片段和非精彩片段。每个流上的D CNN旨在优化功能，使高光片段的检测得分高于非高光片段的检测得分

As the first step towards this goal, Figure 1 demonstrates the process of retrieving highlights in raw first-person videos. A raw video is divided into several segments. The highlight measure for each segment is equivalent to learning a function to predict the highlight score given the representations of the segment. The higher the score, the more highlighted the segment. Thus, the segments with high scores can be selected as video highlights. Furthermore, in order to incorporate both spatial and temporal information for better depicting a video segment, complementary streams on visual appearance from static frames and temporal dynamics across multiple frames are jointly exploited. As such, we devise a two-stream deep convolution neural networks (DCNN) architecture by fusing DCNN on each stream for video highlight detection. In particular, considering that highlight score expresses only a relative degree of interest within each video, we propose to train DCNN on each stream independently with a pairwise deep ranking model, which characterizes the relative relationships by a set of pairs. Each pair contains a highlight and a non-highlight segment from the same video. The DCNN on each stream aims to optimize the function making the detection score of highlight segment higher than that of non-highlight segment.

然后，通过为每个片段分配突出显示得分，可以两种方式生成突出显示驱动的视频摘要：视频间隔拍摄和视频剪辑。前者将所有片段保留在视频中，同时根据重点得分（播放速率较低的高亮片段，反之亦然）调整其播放速度。后者组合仅高光片段的序列，同时修剪其他非高光片段。我们在一个新创建的数据集上评估视频精彩片段检测和精彩片段驱动的视频摘要，其中包括GoPro相机针对15个运动类别捕获的大约100个小时的第一人称视频，这是迄今为止规模最大的第一人称视频数据集。

Then, by assigning a highlight score to each segment, the highlight-driven summary of a video can be generated in two ways: video timelapse and video skimming. The former keeps all the segments in the video while adjusting their speed rates of playing based on highlight scores (highlight segments with lower playing rate, and vice versa). The latter assembles the sequence of only highlight segments while trimming out the other non highlight ones. We evaluate both video highlight detection and highlight-driven video summarization on a newly created dataset including about 100 hours of first-person videos captured by GoPro cameras for 15 sport categories, which is so far the largest scale first-person video dataset.

其余部分组织如下。第2节介绍了相关工作。第3节介绍了视频高亮检测的体系结构，而第4节则提出了对预测的视频高亮进行视频汇总的问题。在第5节中，我们提供了有关视频高亮检测和视频摘要的经验评估，随后是第6节中的讨论和结论。

The remaining sections are organized as follows. Section 2 describes the related work. Section 3 presents the architecture of video highlight detection, while Section 4 formulates the problem of video summarization over the predicted video highlight. In Section 5, we provide empirical evaluations on both video highlight detection and video summarization, followed by the discussions and conclusions in Section 6.

2. Related Work

第一人称视频摘要的研究领域近来受到越来越多的关注[6、10、13、14、21]。视频摘要的目的是探索较长的第一人称视频序列的最重要部分。在[13]中，通过使用对象的重要性作为决策标准来选择视频的短子序列。在精神上类似，将描述基本事件的视频子快照连接起来以生成摘要[14]。稍后在[6]和[21]中，Gygli等人。和Potapov等。将问题描述为分别根据视觉重要性和趣味性对每个视频片段进行评分。然后，通过选择得分最高的细分产生摘要。最近，乔希等人。通过为每个输出帧渲染，拼接和混合适当选择的源帧，创建了稳定的延时视频[10]。相比之下，我们的方法探索了用户对视频感兴趣的时刻，我们展示了这一点对于提炼原始视频的本质至关重要

The research area of first-person video summarization is gaining an increasing amount of attention recently [6, 10, 13, 14, 21]. The objective of video summarization is to explore the most important parts of long first-person video sequences. In [13], a short subsequence of the video was selected by using the importance of the objects as the decision criteria. Similar in spirit, video subshots which depict the essential events were concatenated to generate the summary [14]. Later in [6] and [21], Gygli et al. and Potapov et al. formulated the problem as scoring each video segment in terms of visual importance and interestingness, respectively. Then the summary was produced by selecting the segments with highest scores. Recently, Joshi et al. created a stabilized time-lapse video by rendering, stitching and blending appropriately selected source frames for each output frame [10]. In contrast, our approach explores the moments of user interest in the videos, which we show is vital to distill the essence in the original videos

除了第一人称视频摘要外，还有大量有关常规视频摘要的文献。基于关键帧或镜头的方法使用原始视频中具有代表性的关键帧或片段的子集来生成摘要。在[15]中，关键帧和视频镜头是根据它们的注意力得分进行采样的，这些得分是通过结合视觉和听觉注意力来衡量的。 Ngo等人在精神上相似。提出了一个视频作为一个完整的无向图，将其划分为多个视频簇以形成一个时间图并进一步检测视频场景[18]。可以从时间图的结构和关注信息两个方面生成视频摘要。稍后在[16]中，首先检测到子镜头，并根据相机的主要运动将其分为五类。然后从每个子镜头中提取许多代表性关键帧以及结构和运动信息，以生成视频摘要。

In addition to first-person video summarization, there is a large literature on summarization for general videos. Keyframe or shot-based methods use a subset of representative keyframes or segments from the original video to generate a summary. In [15], keyframes and video shots were sampled based on their scores of attention, which are measured by combining both visual and aural attention. Similar in spirit, Ngo et al. presented a video as a complete undirected graph, which is partitioned into video clusters to form a temporal graph and further detect video scenes [18]. Video summarization can be generated from the temporal graph in terms of both the structure and attention information. Later in [16], subshots were first detected and classified into five categories according to the dominant camera motion. Then a number of representative keyframes, as well as structure and motion information were extracted from each subshot to generate the video summary.

与关键帧或基于镜头的方法不同，结构驱动的方法利用视频结构进行汇总。广播体育视频中通常存在定义明确的结构。长时间的体育比赛可以分为几个部分，而这些部分中只有少数包含某些信息部分。例如，这些细分包括足球比赛中的得分时刻或棒球比赛中的击打时刻。基于明确定义的结构，在结构驱动方法中使用了专门设计的视听功能，例如人群，欢呼，目标等[17，22]。

Different from keyframe or shot-based methods, structure-driven approaches exploit video structure for summarization. The well-defined structure often exists in broadcast sports videos. A long sports game can be divided into parts and only a few of these parts contain certain informative segments. For instance, these segments include the score moment in soccer games or the hit moment in baseball games. Based on the well-defined structure, specifically designed audio-visual features, such as crowds, cheering, goal, etc., are used in the structure-driven methods [17, 22].

上面的大多数方法都专注于独立选择帧，镜头或片段，而忽略它们之间的关系。我们的工作与我们声称以成对方式学习视频片段的关系不同，它描述了视频中所有片段的相对偏好，并且将有益于视频摘要。

Most of the above methods focus on selecting frames, shots or segments independently, ignoring the relationship between them. Our work is different that we claim to learn the relationship of video segments in a pairwise manner, which characterizes the relative preferences of all the segments within a video and will benefit video summarization.

图2.高光驱动的视频汇总框架（以更好的颜色观看）。（a）输入视频分为一组片段。（b）将每个视频片段分解为空间和时间流。空间流采用多帧外观的形式，而时间流则通过视频剪辑中的时间动态来表示。针对空间和时间流分别设计了用于突出显示预测的深度卷积神经网络体系结构。通过后期融合将这两个分量的输出合并为每个视频段的最终高光得分。（c）通过为每个视频片段分配一个高光得分，可以获得每个视频的高光曲线。视频中具有最高亮点得分最高的片段被视为“亮点”。（d）可以很容易地制定两种高亮驱动的视频摘要方法，即视频间隔拍摄和视频剪辑。

Figure 2. Highlight-driven video summarization framework (better viewed in color). (a) The input video is split into a set of segments. (b) Each video segment is decomposed into spatial and temporal streams. The spatial stream is in the form of multiple frame appearance, while the temporal stream is represented by temporal dynamics in a video clip. A deep convolution neural networks architecture for highlight prediction is devised for spatial and temporal stream, respectively. The output of the two components are combined by late fusion as the final highlight score for each video segment. © By assigning a highlight score to each video segment, a highlight curve can be obtained for each video. The segments with highest highlight scores are regarded as “highlights” in the video. (d) Two highlight-driven video summarization methods, i.e., video timelapse and video skimming, can be easily formulated.

3. Video Highlight Detection

在本节中，我们首先通过在空间和时间流上结合两个深度卷积神经网络体系结构，然后是用于训练每个DCNN结构的成对深度排序模型，来针对每个视频片段展示重点检测。

In this section, we first present our highlight detection for each video segment by combining two deep convolution neural networks architectures on spatial and temporal stream, followed by the pairwise deep ranking model for the training of each DCNN structure.

3.1. Two-Stream DCNN for Highlight Detection

视频可以自然地分解为时空分量，分别与人类感知的腹侧和背侧流有关[4，23]。腹侧流在物体识别中起主要作用，而背侧流则介导视觉运动在此类物体上的感觉运动转换。因此，我们通过后期融合空间和时间DCNN进行视频高亮检测来设计一种新颖的两流DCNN架构（TS-DCNN），如图2（a）-（c）所示。空间部分通过帧外观来描述视频中的场景和对象，而时间部分则在视频剪辑（多个帧）中传达时间动态。

Video can be naturally decomposed into spatial and temporal components, which are related to ventral and dorsal streams for human perception respectively [4, 23]. The ventral stream plays the major role in the identification of objects, while the dorsal stream mediates the sensorimotor transformations for visually guided actions at such objects. Therefore, we devise a novel two-stream DCNN architecture (TS-DCNN) by late fusing spatial and temporal DCNN for video highlight detection, as shown in Figure 2 (a)-©. The spatial component depicts scenes and objects in the video by frame appearance while the temporal part conveys the temporal dynamics in a video clip (multiple frames).

给定输入视频，可以在时间，镜头边界检测或更改点检测中通过统一划分来划分一组视频片段。对于每个视频片段，空间DCNN会对从该片段提取的多个帧进行操作。静态帧外观非常有用，因为某些高光与特定的场景和对象紧密相关。架构的第一阶段是为每个视频片段生成固定长度的视觉表示。为了这个目的，AlexNet [12]是最近先进的图像分类架构，被用于提取多个帧的softmax分数。然后，对所有帧执行平均合并[1]，以获得每个视频段的单个1,000维矢量。 AlexNet已在ImageNet挑战数据集的120万张图像上进行了预训练[2]。视频片段的结果1,000维表示形式形成了到随后的神经网络的输入，以预测该片段的高光得分。该神经网络的体系结构为F 1000-F 512-F 256-F 128-F 64-F 1，其中包含六个完全连接的层（用F表示神经元的数量）。最后一层的输出作为高光得分。

Given an input video, a set of video segments can be delimited by uniform partition in temporal, shot boundary detection, or change point detection. For each video segment, spatial DCNN operates on multiple frames extracted from the segment. The static frame appearance is useful as some highlights are strongly associated with particular scenes and objects. The first stage of the architecture is to generate a fixed-length visual representation for each video segment. For this purpose, AlexNet [12], which is the recent advance image classification architecture, is exploited for extracting the softmax scores for multiple frames. Then, an average pooling [1] is performed over all the frames to get a single 1,000 dimensional vector for each video segment. The AlexNet is pre-trained on 1.2 million images of ImageNet challenge dataset [2]. The resulting 1,000 dimensional representation of video segment forms the input to a following neural networks for predicting the highlight score of this segment. The architecture of this neural networks is F 1000−F 512−F 256−F 128−F 64−F 1, which contains six fully-connected layers (denoted by F with the number of neurons). The output of the last layer is taken as the highlight score.

与空间DCNN不同，时间DCNN架构的输入由多个视频片段组成，每个视频片段包含多个帧。这些输入明确描述了帧之间的时间动态。为了生成每个视频剪辑的表示，使用3D CNN。与传统的2D CNN不同，3D CNN体系结构以视频剪辑作为输入，由交替的3D卷积和3D池化层组成，这些层进一步由[8]中所述的几个完全连接的层所覆盖。具体来说，利用了在Sports-1M视频数据集[11]上预先训练的C3D [25]，我们将fc6全连接层的输出视为每个视频剪辑的表示。类似于空间DCNN体系结构，时域DCNN将C3D的输出融合到每个视频剪辑上，然后导入到神经网络中以进行视频高光检测。

Unlike the spatial DCNN, the inputs to temporal DCNN architecture are comprised of multiple video clips and each video clip contains multiple frames. Such inputs explicitly describe the temporal dynamics between frames. To generate the representations of each video clip, 3D CNN is utilized. Different from traditional 2D CNN, 3D CNN architecture takes video clip as the inputs and consists of alternating 3D convolutional and 3D pooling layers, which are further topped by a few fully connected layers as described in [8]. Specifically, C3D [25], which is pre-trained on Sports-1M video dataset [11], is exploited and we regard the outputs of the fc6 fully-connected layer as representations for each video clip. Similar to spatial DCNN architecture, temporal DCNN fuses the outputs of C3D on each video clip, followed by importing into a neural networks for video highlight detection.

图3.使用成对的深度排名模型训练空间DCNN架构。输入是一组突出显示和非突出显示的视频段对，它们分别被馈入具有共享架构和参数的两个相同的空间DCNN。排名层位于顶部，以评估货币对的保证金排名损失。请注意，时间DCNN的训练遵循相同的原理。

Figure 3. The training of spatial DCNN architecture with pairwise deep ranking model. The inputs are a set of highlight and non-highlight video segment pairs, which are fed independently into two identical spatial DCNN with shared architecture and parameters. A ranking layer is on the top to evaluate the margin ranking loss of the pair. Note that the training of temporal DCNN follows the same philosophy.

通过后期融合空间和时间DCNN的两个预测的突出显示得分，我们可以获得每个视频段的最终突出显示得分，并形成整个视频的突出显示曲线。得分高的视频片段将相应地被选为视频精彩片段。值得注意的是，尽管这里使用的两个流是视觉外观和时间动态，但是我们的方法适用于包括任何其他流，例如音频流。

By late fusing the two predicted highlight scores of spatial and temporal DCNN, we can obtain the final highlight score for each video segment and form a highlight curve for the whole video. The video segments with high scores are selected as video highlights accordingly. It is worth noticing that although the two streams used here are visual appearance and temporal dynamics, our approach is applicable to include any other stream, e.g., audio stream.

3.2. Pairwise Deep Ranking Model

与大多数深度学习问题一样，学习我们的空间和时间DCNN架构对于视频高亮检测至关重要。现有的用于视觉识别的深度学习模型通常侧重于学习类别级别的表示[9，12]。学习的表示形式主要对应于视觉语义。取而代之的是，在我们的案例中，每个视频片段的精彩片段得分都反映了其在视频中的兴趣程度，并代表了相对的指标。这是将其表述为监督排名问题的直接方法。更重要的是，视觉识别和排序任务之间的内在属性是不同的，因为将视觉识别建模为二进制分类问题，而将排序任务视为回归问题。这样，用于视觉识别的良好网络对于区分视频精彩片段可能不是最佳的。

As with most deep learning problems, the learning of our spatial and temporal DCNN architectures are critical for video highlight detection. Existing deep learning models for visual recognition often focus on learning category-level representation [9, 12]. The learnt representations mainly correspond to visual semantics. Instead, in our case, a highlight score for every video segment reflects its degree of interest within a video and represents a relative measure. It is a straight forward way to formulate it as a supervised ranking problem. More importantly, the intrinsic property between visual recognition and ranking tasks is different, as visual recognition is modeled as a binary classification problem while ranking task is considered as a regression problem. As such, a good network for visual recognition may not be optimal for distinguishing video highlights.

从通过排名探索相对关系的想法出发[27，28]，我们开发了一个成对的深度排名模型，以学习我们的时空DCNN体系结构来预测视频精彩片段。图3显示了使用成对深度排序模型对空间DCNN的训练。给定一对突出显示和非突出显示的视频片段，我们希望优化空间DCNN架构，与非突出显示的片段相比，其输出的突出显示片段得分更高。形式上，假设我们有一组对P，其中每个对（hi，ni）包含一个高亮视频片段hi和来自同一视频的非高亮片段ni。这两个部分分别馈入具有共享架构和参数的两个相同的空间DCNN。一对表示两个视频片段的相对突出程度。空间DCNN的输出f（·）计算输入视频片段的高光得分。我们的目标是学习DCNN架构，该架构为高光部分分配更高的输出分数，可以将其表示为

Deriving from the idea of exploring relative relationship through ranking [27, 28], we develop a pairwise deep ranking model to learn our spatial and temporal DCNN architectures for predicting video highlights. Figure 3 shows the training of spatial DCNN with pairwise deep ranking model. Given a pair of highlight and non-highlight video segments, we wish to optimize our spatial DCNN architecture, which could output a higher score of highlight segment than that of non-highlight one. Formally, suppose we have a set of pairs P, where each pair (hi, ni) consists of a highlight video segment hi and a non-highlight segment ni from an identical video. The two segments are fed separately into two identical spatial DCNN with shared architecture and parameters. A pair characterizes the relative highlight degree for the two video segments. The output f(·) of the spatial DCNN computes the highlight score of the input video segment. Our goal is to learn the DCNN architecture that assigns higher output score to the highlight segment, which can be expressed as

(1)\tag{1}(1)

由于输出得分显示了视频片段的相对排名顺序，因此采用顶部的排名层来评估每对的边距排名损失，这是0-1排名错误损失的凸近似值，并且已用于几种信息检索方法[20，29]。具体来说，可以通过

As the output scores exhibit a relative ranking order for the video segments, a ranking layer on the top is employed to evaluate the margin ranking loss of each pair, which is a convex approximation to the 0-1 ranking error loss and has been used in several information retrieval methods [20, 29]. Specifically, it can be given by

(2)\tag{2}(2)

排名层没有任何参数。在学习过程中，它评估模型是否违反排名顺序，并将梯度向后传播到较低的层，以便较低的层可以调整其参数以最大程度地降低排名损失。为了避免过度拟合，在我们的体系结构中，将AlexNet之后的所有完全连接的图层应用[0.5]的遗漏[7]。

The ranking layer does not have any parameters. During learning, it evaluates the model’s violation of the ranking order, and back-propagates the gradients to the lower layers so that the lower layers can adjust their parameters to minimize the ranking loss. To avoid overfitting, dropout [7] with a probability of 0.5 is applied to all the fully-connected layers after AlexNet in our architecture.

时域DCNN训练的过程与空间DCNN相同。训练后，将学习的空间和时间DCNN架构后期融合以进行视频高亮检测，如图2（b）所示。

The process of temporal DCNN training is the same as spatial DCNN. After training, the learnt spatial and temporal DCNN architectures are late fused for video highlight detection as shown in Figure 2 (b).

4. Highlight-driven Video Summarization

在获得每个视频片段的突出显示分数之后，如何将其用于视频摘要？可以很容易地制定两种视频摘要方法，即视频间隔拍摄和视频剪辑（图2（d））。

After we get the highlight score for each video segment, how to use them for video summarization? Two video summarization approaches, i.e., video timelapse and video skimming (Figure 2 (d)), will be easily formulated.

4.1. Video Timelapse

视频摘要的一种简单而强大的技术是延时，即通过选择每个第r帧并以慢动作显示高光片段来提高非高光视频片段的速度。特别是，由于所有片段最终都包括在内，因此对视频分割没有严格要求，因此在这种情况下，我们仅将视频均匀地分成多个片段，而不是分析视频内容。令Lv，Lh和Ln分别为原始视频的长度，突出显示段和非突出显示段。通常我们有Lh≪ Ln，Lv。在不失一般性的情况下，我们考虑减速高光部分和加速非高光部分的速率相同的情况，并将r表示为速率。给定摘要L的最大长度，则问题等同于找到一个满足公式的适当比率r

A simple and robust technique for video summarization is timelapse, i.e., increasing the speed of the non-highlight video segments by selecting every rth frame and showing the highlight segments in slow motion. Particularly, as all the segments are included finally and thus there is no strict demand on video segmentation, we simply divide the video into segments evenly in this case rather than analyzing video content. Let Lv, Lh and Ln be the length of original video, highlight segments and non-highlight segments, respectively. Typically we have Lh ≪ Ln, Lv. Without loss of generality, we consider the case when the rate of decelerating highlight segments and speeding up non-highlight parts is the same and denote r as the rate. Given a maximum length of summary L, the problem is then equivalent to find a proper rate r which satisfies the formula as

(3)\tag{3}(3)

这样，我们通过压缩非突出显示的视频片段同时扩展突出显示的部分来生成视频摘要。总的来说，视频间隔拍摄具有两个主要特征。首先，所有视频内容都包含在摘要中。这样一来，就不会遗漏任何重要部分，从而使摘要在讲述相机佩戴者的故事时更加连贯和连续。此外，对感兴趣的视频片段加下划线并进行详细介绍。

In this way, we generate a video summary by compressing the non-highlight video segments while expanding the highlight parts. In general, video timelapse has two major characteristics. First, all the video content are contained in the summary. As a result, there is no risk of omitting any important segments, making the summary more coherent and continuous in telling camera wearer’s story. Furthermore, the video segments of interest are underlined and presented in detail.

4.2. Video Skimming

视频摘要通过提供原始视频的简短摘要来解决摘要问题，理想情况下，摘要包括所有重要的视频片段。视频剪辑的一种常见做法是首先执行时间分割，然后根据某些条件（例如，趣味性和重要性）挑选出几个片段以形成最佳摘要。继[21]之后，我们利用了内核时间分割方法，该方法源自多变化点检测的思想。读者可以参考[21]以获得更多技术细节。

Video skimming addresses the summarization problem by providing a short summary of original video which ideally includes all the important video segments. A common practice to video skimming is to first perform a temporal segmentation, followed by singling out a few segments to form an optimal summary in terms of certain criteria, e.g., interestingness and importance. Following [21], we exploit a kernel temporal segmentation approach which is originated from the idea of multiple change point detection. Readers can refer to [21] for more technical details.

分割之后，将高光检测应用于每个视频段，从而产生高光得分。给定视频片段的集合S = {s1，…。。。，sc}，并且每个片段都与高光得分f（si）相关联，我们的目标是挑选出一个长度小于最大长度L的子集，并使高光得分的总和最大化。具体来说，问题定义为

After the segmentation, highlight detection is applied to each video segment, producing the highlight score. Given the set of video segments S = {s1, . . . , sc} and each segment is associated with a highlight score f(si), we aim to single out a subset with a length below the maximum length L and the sum of the highlight scores is maximized. Specifically, the problem is defined as

(4)\tag{4}(4)

其中bi∈{0，1}并且bi = 1表示选择了第i个分段。 | si | 是第i个段的长度。最大化是一个标准的0/1背包问题，可以通过动态规划解决，以获得全局最优解[5]。

where bi ∈ {0, 1} and bi = 1 indicates that the ith segment is selected. |si| is the length of the ith segment. The maximization is a standard 0/1-knapsack problem and can be solved with dynamic programming for a global optimal solution [5].

5. Experiments

我们对从YouTube抓取的新创建的第一人称视频数据集进行了实验，并评估了我们在视频精彩片段检测和精彩片段驱动的视频摘要方面的方法。

We conducted our experiments on a newly created first-person video dataset crawled from YouTube and evaluated our approaches on both video highlight detection and highlight-driven video summarization.

5.1. Dataset

尽管最近对第一人称视频分析的研究受到了广泛关注，但迄今为止，公共数据集仍然很小（例如，长达20小时），并且非常具体（例如，在厨房中）。为了充分评估我们的应用程序，我们从YouTube收集了一个新的大型数据集以进行第一人称视频精选。数据集包含100个小时的视频，这些视频主要由GoPro摄像机捕获，涉及15个与运动相关的类别。特别是，我们使用“类别名称+ GoPro”查询YouTube数据库以检索相关视频。给定检索到的视频，那些带有可见编辑痕迹的视频（与场景拼接或渲染一样）将被手动删除。因此，我们的数据集仅包含原始视频。图4显示了从数据集中每个类别中选择的代表性框架。对于每个类别，大约有40个视频，每个视频的时长在2到15分钟之间。

While the research on first-person video analysis has recently received intensive attention, the public datasets to date are still small (e.g., up to 20 hours) and very specific (e.g., in a kitchen). To substantially evaluate our approach, we collect a new large dataset from YouTube for first-person video highlight detection. The dataset consists of 100 hours videos mainly captured by GoPro cameras for 15 sports related categories. In particular, we query the YouTube database with “category name + GoPro” to retrieve relevant videos. Given the retrieved videos, those with visible edited traces, as with scene splicing or rendering, are removed manually. Hence, our dataset is constructed with only raw videos. Figure 4 shows a representative frame selected from each category in our dataset. For each category, there are about 40 videos, each with a duration between 2 to 15 minutes.

为了评估我们的重点结果，我们首先将视频分成一组五秒钟的片段，在每个原始视频中均匀采样，然后请多个评估人员标记每个片段的重点级别。我们邀请了来自不同教育背景的12位评估人员，包括语言学，物理学，商业，计算机科学和设计。所有评估人员都是户外运动爱好者，其中一些来自当地的户外运动俱乐部。每个视频片段均以三点序的比例进行注释：3-高光； 2 –正常； 1 –无聊。为了使注释尽可能客观，每个视频分配了三个标签。仅将总分达到或超过8分的视频片段选为“重点”。请注意，获取这些注释非常耗时。要求贴标签者在将标签分配给每个片段之前先观看整个视频，因为突出显示是视频中的相对判断。对于我们的实验，数据集在所有15个类别上平均分为训练集和测试集。

To evaluate our highlight results, we first split the video into a set of five-second segments evenly sampled across each raw video and ask multiple evaluators to label the highlight level of each segment. We invited 12 evaluators from different education backgrounds, including linguistics, physics, business, computer science, and design. All evaluators are outdoor sports enthusiasts and some of them are from local outdoor sports club. Each video segment was annotated on a three point ordinal scale: 3–highlight; 2–normal; 1–boring. To make the annotation as objective as possible, there are three labelers assigned to each video. Only video segments with their aggregate scores at or over 8 points were selected as “highlight.” Note that obtaining these annotations was very time consuming. The labelers are requested to watch the whole video before assigning labels to each segment as the highlight is a relative judgement within a video. The dataset is partitioned into training and test sets evenly on all 15 categories for our experiments.

5.2. Highlight Detection

在我们的第一人称视频数据集上进行了第一个实验，以检查我们的空间和时间DCNN在高光检测中的工作方式。

The first experiment was conducted on our first-person video dataset to examine how our spatial and temporal DCNNs work on highlight detection.

比较方法。我们比较以下方法进行绩效评估：

Compared Approaches. We compare the following approaches for performance evaluation:

（1）基于规则的模型[16]（规则）。首先根据颜色信息将视频分割为一系列镜头。然后，通过基于运动阈值的方法将每个镜头分解为一个或多个子镜头。每个子镜头的突出显示分数与其长度成正比，这意味着更长的子镜头通常包含更多的信息内容。（2）基于重要性的模型[21]。每个类别的线性SVM分类器经过训练，可以对每个视频片段的重要性（突出显示）进行评分。对于每个类别，我们都将该类别的所有视频片段用作正面示例，而将其他类别的视频片段用作负面示例。我们采用[26]中提出的改进的密集轨迹运动特征和表示每个视频片段的DCNN帧特征的平均值。详细设置将在参数设置中显示。基于改进的密集轨迹和DCNN的两个运行分别命名为Imp + IDT和Imp + DCNN。（3）潜在排名模型[24]。对每个类别的潜在线性排名SVM模型进行训练，以对每个视频片段的亮点进行评分。对于每个类别，每个视频中的所有亮点和非亮点视频片段对均用于训练。类似地，提取改进的密集轨迹和DCNN帧特征的平均值作为每个片段的表示。我们将这两个运行分别称为LR + IDT和LR + DCNN。（4）深度卷积神经网络模型。我们针对提出的方法设计了三种运行方式：S-DCNN，T-DCNN和TS-DCNN。两次运行S-DCNN和T-DCNN分别通过分别使用空间DCNN和时间DCNN来预测视频片段的高光得分。 TS-DCNN的结果是通过后期融合对S-DCNN和T-DCNN进行加权求和。

(1) Rule-based model [16] (Rule). The video is first segmented into a series of shots based on color information. Each shot is then decomposed into one or more subshots by a motion threshold-based approach. The highlight score for each subshot is proportion to its length, giving that the longer subshot usually contains more informative content. (2) Importance-based model [21]. A linear SVM classifier per category is trained to score importance (highlight) of each video segment. For each category, we use all the video segments of this category as positive examples and the video segments from the other categories as negatives. We adopt both improved dense trajectories motion features proposed in [26] and the average of DCNN frame features for representing each video segment. The detailed settings will be presented in parameter settings. The two runs based on improved dense trajectory and DCNN are named as Imp+IDT and Imp+DCNN, respectively. (3) Latent ranking model [24]. A latent linear ranking SVM model per category is trained to score highlight of each video segment. For each category, all the highlight and non-highlight video segment pairs within each video are exploited for training. Similarly, improved dense trajectories and the average of DCNN frame features are extracted as the representations of each segment. We refer to the two runs as LR+IDT and LR+DCNN, respectively. (4) Deep Convolution Neural Networks model. We designed three runs for our proposed approaches: S-DCNN, T-DCNN and TS-DCNN. The two runs S-DCNN and T-DCNN predict the highlight score of video segment by separately using spatial DCNN and temporal DCNN, respectively. The result of TS-DCNN is the weighted summation of S-DCNN and T-DCNN by late fusion.

参数设置。在实验中，我们每秒均匀地拾取3帧，因此每5秒的视频片段有15帧。 [25]之后，每个视频剪辑都由每秒的前16个连续帧组成，然后每个段具有5个视频剪辑。对于S-DCNN和T-DCNN训练，仅选择其总得分相差超过3分的线段对，总共有105K对训练集。

Parameter Settings. In the experiments, we uniformly pick up three frames every second and hence have 15 frames for each five seconds’ video segment. Following [25], each video clip is composed of the first 16 continuous frames in every second and then have 5 video clips for each segment. For S-DCNN and T-DCNN training, only the segment pairs, the difference of whose aggregate scores is over 3 points, are selected and in total we have 105K pairs in training set.

为了确保这些方法的性能具有可比性，视频片段的表示形式是Imp + DCNN和LR + DCNN中AlexNet在选定帧上输出的平均值，与我们的S-DCNN相同。对于轨迹描述符的提取，我们使用默认参数，可生成426个维。然后，在描述符的每个组件上分别使用PCA将局部描述符缩减为尺寸的一半。最后，每个视频片段都基于256高斯的GMM编码在Fisher向量[19]中。此外，按照[24]中的设置，我们使用liblinear软件包[3]以相同的停止标准训练LR + IDT和LR + DCNN：最多进行10K次迭代，ε= 0.0001。

To ensure the performance of these methods comparable, the representation of video segment is the average of the outputs of AlexNet on selected frames in both Imp+DCNN and LR+DCNN, which is the same as our S-DCNN. For the extraction of trajectory descriptor, we use the default parameters which results in 426 dimensions. The local descriptor is then reduced to half of the dimensions with PCA separately on each component of the descriptor. Finally, each video segment is encoded in a Fisher Vector [19] based on a GMM of 256 Gaussians. Furthermore, following the setting in [24], we use the liblinear package [3] to train both LR+IDT and LR+DCNN with the same stopping criteria: maximum 10K iterations and ε = 0.0001.

评估指标。我们计算测试集中每个视频的高光检测的平均精度，并报告所有测试视频性能的平均平均精度（mAP）。此外，由于自然要把高光检测作为一个视频中的片段排名问题，因此我们进一步采用归一化折价累积增益（NDCG），该方法考虑了多级高光分数的度量作为性能指标。给定视频的片段排名列表，排名列表中d深度的NDCG得分定义为：NDCG @ d = Zd Pd j = 1 log（1+ 2rj -1j），其中rj = {5：as ≥8; 4：等于= 7; 3：等于= 6； 2：等于= 5; 1：≤4}表示基本事实中某个部分的等级，而表示每个部分的总分。 Zd是一个归一化常数，经过选择后，NDCG @ d = 1即可获得完美排名。最终指标是测试集中所有视频的NDCG @ d平均值。

Evaluation Metrics. We calculate the average precision of highlight detection for each video in test set and mean Average Precision (mAP) averaging the performance of all test videos is reported. In addition, since it is naturally to treat highlight detection as a problem of ranking segments in one video, we further adopt Normalized Discounted Cumulative Gain (NDCG) which takes into account the measure of multi-level highlight scores as the performance metric. Given a segment ranked list for a video, the NDCG score at the depth of d in the ranked list is defined by: NDCG@d = Zd Pd j=1 log(1+ 2rj −1j), where rj = {5 : as ≥ 8; 4 : as = 7; 3 : as = 6; 2 : as = 5; 1 : as ≤ 4} represents the rating of a segment in the ground truth and as denotes the aggregate score of each segment. Zd is a normalization constant and is chosen so that NDCG@d = 1 for perfect ranking. The final metric is the average of NDCG@d for all videos in the test set.

性能比较。图5显示了我们数据集中所有测试视频的八次运行平均表现。总体而言，不同评估指标的结果一致表明，我们的TS-DCNN相对于其他指标而言，性能得到了提升。特别是，TS-DCNN的mAP可以达到0.3574，与LR + IDT相比提高了10.5％。更重要的是，TS-DCNN的运行时间比LR + IDT短几十倍，下面的运行时间部分提供了更多详细信息。由于Rule run仅基于常规子镜头检测，并且没有任何先验知识，因此其他所有方法都表现出比Rule run更好的性能也就不足为奇了。

Performance Comparison. Figure 5 shows the performances of eight runs averaged over all the test videos in our dataset. Overall, the results across different evaluation metrics consistently indicate that our TS-DCNN leads to a performance boost against others. In particular, the mAP of TS-DCNN can achieve 0.3574, which makes the improvement over LR+IDT by 10.5%. More importantly, the run time of TS-DCNN is less than LR+IDT by several dozen times and more details are given in the following run time section. Since Rule run is only based on the general subshot detection and without any highlight knowledge as a prior, it is not surprise that all the other methods exhibit significantly better performance than the Rule run.

S-DCNN（T-DCNN）和LR + DCNN（LR + IDT）之间存在明显的性能差距。尽管两次运行都以成对方式训练模型，但是它们在本质上有所不同，即S-DCNN（T-DCNN）使用DCNN架构，而LR + DCNN（LR + IDT）基于排序SVM技术。结果基本上表明了在突出显示检测任务上利用DCNN架构的优势。此外，LR + DCNN（LR + IDT）的性能优于Imp + DCNN（Imp + IDT），后者通过使用线性SVM模型将高亮检测表示为二进制分类问题。此外，如我们的结果所示，在所有三个模型上，使用运动（时间）特征可以提供比多个静态框架外观更好的性能。这多少表明视频集锦经常以某些特殊动作出现在片段中，因此可以用时间特征更好地表现出来。

There is a significant performance gap between S-DCNN (T-DCNN) and LR+DCNN (LR+IDT). Though both runs train the model in a pairwise manner, they are fundamentally different in the way that S-DCNN (T-DCNN) is by using a DCNN architecture, and LR+DCNN (LR+IDT) is based on ranking SVM techniques. The results basically indicate the advantage of exploiting DCNN architecture on highlight detection task. Furthermore, LR+DCNN (LR+IDT) exhibits better performance than Imp+DCNN (Imp+IDT) which formulates highlight detection as a binary classification problem by using linear SVM model. In addition, as observed in our results, using motion (temporal) features can constantly offer better performance than multiple static frame appearances on all three models. This somewhat reveals that video highlights often appear in the segments with some special motions and hence they could be better represented by temporal features.

图6详细说明了不同类别中的mAP性能。与特定类别的基于重要性的模型和潜在排名模型不同，我们的模型适用于所有类别。尽管如此，在这15个类别中，我们的S-DCNN，T-DCNN和TS-DCNN在11个类别中取得了最佳性能，这从类别无关的角度通过经验验证了我们模型的优点。但是，考虑到所有15个类别都与体育有关，因此尚不清楚所提议的技术是否可以通用化以处理来自所有领域的视频内容。而且，通常期望S-DCNN和T-DCNN之间的互补性。例如，“击剑”类别的视频在帧外观上各不相同，从而导致S-DCNN的效果不佳。取而代之的是，发现时态动态对此类别更有用。对于运动相对较少的“高尔夫”类别，框架的视觉特征显示出更好的性能。

Figure 6 details the mAP performance across different categories. Different from importance-based model and latent ranking model which are category-specific, our model is general for all categories. Nevertheless, among the 15 categories, our S-DCNN, T-DCNN and TS-DCNN achieve the best performances for 11 categories, which empirically verify the merit of our model from the aspect of category independent. However, considering that all the 15 categories are sports related, it is still not clear that whether the proposed techniques are generalized to handle video contents from all domains. Moreover, the complementarity between S-DCNN and T-DCNN is generally expected. For instance, the videos of the category “fencing” is diverse in frame appearance, resulting in poor performance by S-DCNN. Instead, temporal dynamics is found to be more helpful for this category. In the case of category “golf” where motion is relatively few, visual features of frames show better performance.

图8进一步显示了从视频中均匀采样的十个片段，用于“冲浪”，“跳伞”和“攀爬”。每个片段由一个采样帧表示。如图所示，我们的TS-DCNN根据预测的重点得分对这10个片段进行了排名，我们可以很容易地看到排名顺序反映了视频中的相对兴趣度。

Figure 8 further shows ten segments uniformly sampled from a video for “surfing,” “skydiving,” and “climbing.” Each segment is represented by one sampled frame. As illustrated in the figure, the ten segments are ranked according to their predicted highlight scores by our TS-DCNN and we can easily see that the ranking order reflects the relative degree of interest within a video.

融合参数。后期融合的一个常见问题是需要设置参数以权衡S-DCNN和T-DCNN，即ωin S×DCN +（1 −ω）×T DCNNN。在先前的实验中，就mAP性能而言，ω是最佳设置。此外，当ω的值设置为0.1到0.9时，我们进行了实验以测试性能。图7显示了相对于不同的ω值的mAP，NDCG @ 1和NDCG @ 5性能。我们可以看到，性能曲线比较平滑，并且在ω= 0.3左右可获得最佳结果。总的来说，这再次证实了T-DCNN可以带来更好的性能提升，因此在融合中具有更大的权重。

Fusion Parameter. A common problem with late fusion is the need to set the parameter to tradeoff S-DCNN and T-DCNN, i.e., ω in ω × S-DCNN + (1 − ω) × T-DCNN. In the previous experiments, ω was optimally set in terms of mAP performance. Furthermore, we conducted experiments to test the performance, when the values of ω are set from 0.1 to 0.9. Figure 7 shows the mAP, NDCG@1 and NDCG@5 performances with respect to different values of ω. We can see that the performance curves are relatively smooth and achieve the best result around ω = 0.3. In general, this again confirms that T-DCNN leads to better performance gain and thus is given more weights in fusion.

运行。表1列出了每种方法预测5分钟视频的详细运行时间。请注意，LR + IDT和Imp + IDT，LR + DCNN和Imp + DCNN，T-DCNN和TS-DCNN的运行时间分别相同，表中仅列出了其中一个。我们看到我们的方法在性能和效率之间具有最佳的权衡。我们的TS-DCNN完成时间为360秒，比视频时长稍长。

Run Time. Table 1 listed the detailed run time of each approach on predicting a five minutes’ video. Note that the run time of LR+IDT and Imp+IDT, LR+DCNN and Imp+DCNN, T-DCNN and TS-DCNN is the same respectively, only one of each is presented in the Table. We see that our method has the best tradeoff between performance and efficiency. Our TS-DCNN finishes in 360 seconds, which is slightly longer than the video duration.

5.3. Video Summarization

进行了第二个实验，以评估我们的精彩片段驱动的视频摘要。
比较方法。我们比较以下方法进行绩效评估：

The second experiment was performed to evaluate our highlight-driven video summarization.
Compared Approaches. We compare the following approaches for performance evaluation:

（1）统一抽样（UNI）。通过在整个视频中统一选择K个子快照的简单方法。（2）重要性驱动摘要[21]（IMP）。首先将内核时间分段应用于视频分段，然后通过Imp + DCNN将重要性得分分配给每个分段。最后，按摘要的重要性顺序将其包括在摘要中。（3）兴趣驱动的摘要[6]（INT）。该方法首先找到适合切割的位置，然后获取一组线段。每个片段的兴趣度得分是其帧的兴趣度之和，其由低级（例如，质量和显着性）和高级（例如，运动和人物检测）特征共同估算。基于兴趣度得分，选择片段的最佳子集以连接摘要。（4）突出显示摘要。我们针对第4节中介绍的建议的高光驱动视频摘要方法设计了两个运行，即HD-VT和HD-VS，它们分别利用了视频间隔拍摄和视频剪辑技术。

(1) Uniform sampling (UNI). A simple approach by uniformly selecting K subshots throughout the video. (2) Importance-driven summary [21] (IMP). Kernel temporal segmentation is first applied to video segmentation, then an importance score is assigned to each segment by Imp+DCNN. Finally, segments are included in the summary by the order of their importance. (3) Interestingness-driven summary [6] (INT). The method starts by finding positions appropriate for a cut and gets a set of segments. The interestingness score of each segment is the sum over the interestingness of its frames, which are jointly estimated by low-level (e.g., quality and saliency) and high-level (e.g., motion and person detection) features. Based on the interestingness score, an optimal subset of segments is selected to concatenate a summary. (4) Highlight-driven summary. We designed two runs for our proposed highlight-driven video summarization approaches described in Section 4, i.e. HD-VT and HD-VS, which exploit video timelapse and video skimming techniques respectively.

性能比较。我们进行主观评估以比较生成的摘要。评估过程如下。首先，所有评估人员都必须观看原始视频。然后，我们向他们展示该视频的两个摘要。一种是通过HD-VT或HD-VS，另一种是来自其余四次运行。请注意，我们不会透露是我们自己的，而是会随机排序两个摘要。在查看完这两个视频后，向评估人员提出了两个问题：1）覆盖范围：哪个摘要更好地介绍了视频的进度？ 2）演示：哪个摘要更好地提炼并展示了视频的本质？

Performance Comparison. We conduct subjective evaluation to compare the generated summaries. The evaluation process is as follows. First, all the evaluators are required to watch the original video. Then we show them once two summaries for that video. One is by HD-VT or HD-VS and the other is from the rest four runs. Note that we do not reveal which is ours and order the two summaries randomly. After viewing both, the evaluators are asked two questions: 1) Coverage: Which summary better covers the progress of the video? 2) Presentation: Which summary better distills and presents the essence of the video?

我们从15种类别的测试集中随机选择了三个视频，评估集由45个原始视频组成，每个视频有五个摘要。由于仅考虑了我们的方法与其他三个基线之间的比较，因此总共有45×7对摘要要进行测试。我们邀请了35位来自不同教育背景的评估人员，他们的年龄介于20-52岁之间。在每对比较的方法中，在全部45个视频中平均35位评估者的选择百分比，并最终进行报告。

We randomly selected three videos in the test set from each of the 15 categories and the evaluation set is consisted of 45 original videos associated with five summaries for each. As only the comparisons between our methods and other three baselines are taken into account, we have 45 × 7 pairs of summaries to be tested in total. We invited 35 evaluators from different education backgrounds and they range from 20-52 years old. On each pair of compared approaches, the percentage of 35 evaluators’ choices is averaged on all 45 videos and finally reported.

表2和表3分别显示了我们提出的HD-VT和HD-VS的统计数据。总体而言，就覆盖率和表示标准而言，绝大多数评估人员更喜欢HD-VT和HD-VS生成的摘要，而不是其他三种方法。结果支持了我们的观点，即视频摘要可以提取用户感兴趣的时刻，从而更好地总结第一人称视频。与HD-VS相比，HD-VT保留了所有视频内容，因此在Coverage中具有更多的首选项。相比之下，HD-VS受益于略过非突出显示部分的方式，因此在Presentation中获得了更多选票。此外，将统一选择的子镜头连接起来的UNI通常不能提供较长且没有结构的第一人称视频。尽管IMP和INT都涉及对每个视频片段进行评分以进行汇总的方法，但他们将评分公式化为分类问题，并且倾向于包含更多重复的片段。相比之下，HD-VT和HD-VS将其视为成对的排名问题，并且在区分每个段方面具有更好的能力，因此可以在覆盖率和呈现方式方面进行更好的汇总。

Table 2 and 3 show the statistics of our proposed HD-VT and HD-VS, respectively. Overall, a strong majority of the evaluators prefer the summaries produced by HD-VT and HD-VS over other three methods in terms of both Coverage and Presentation criteria. The results support our point that video highlights can distill the moments of user interest and thus better summarize the first-person videos. Compared to HD-VS, HD-VT achieves more preferences in Coverage as it keeps all the video content. HD-VS, in contrast, is benefited from the way of skimming the non-highlight segments and hence gets more votes in Presentation. Furthermore, UNI which concatenates the uniformly selected subshots is not informative for the long and unstructured first-person videos in general. Though both IMP and INT involve utilization of scoring each video segment for summarization, they formulate the scoring as a classification problem and tend to include more near-duplicate segments. HD-VT and HD-VS, in contrast, treat it as a pairwise ranking problem and have a better ability in differentiating each segment, and thus allow better summarization in terms of both Coverage and Presentation.

6. Discussion and Conclusion

我们已经提出了一种新的范式，用于探索用户感兴趣的时刻，即第一人称视频摘要的重点。特别是，我们提出了一种与类别无关的深度视频高亮检测模型，该模型结合了基于深度卷积神经网络的时空流。在大型的第一人称视频数据集上，与其他高亮检测技术（例如线性SVM分类器和潜在线性排名SVM模型）相比，它们的性能均得到了改善，这两种都是特定于类别的。此外，为了获得两种摘要方法，我们开发了高光驱动的视频摘要系统。我们对35个人类受试者的用户研究表明，与基于重要性和基于兴趣的方法相比，大多数用户更喜欢我们的摘要。

We have presented a new paradigm of exploring the moments of user interest, i.e., highlights, for first-person video summarization. Particularly, we propose a category-independent deep video highlight detection model, which incorporates both spatial and temporal streams based on deep convolution neural networks. On a large first-person video dataset, performance improvements are observed when comparing to other highlight detection techniques such as linear SVM classifier and latent linear ranking SVM model, which are both category-specific. Furthermore, together with two summarization methods, we have developed our highlight-driven video summarization system. Our user study with 35 human subjects shows that a majority of the users prefer our summaries over both importance-based and interestingness-based methods.