VALSE2019总结(2)-以人为中心的视觉理解

2. 以人为中心的视觉理解 (ceiwu lu, SJU)

2.1 基于视频的时序建模和动作识别方法 (liming wang, NJU)

dataset
- 两张图：
- 注意一个区分：trimmed and untrimmed videos
outline
- action recognition
- action temporal localization
- action spatial detection
- action spatial-temporal detection
opportunities and challenges
- opportunities ：
  - videos provide huge and rich data for visual learning
  - action is core in motion perception and has many applications in video understanding
- challenges：
  - complex dynamics and temporal variations
  - action vocabulary is not well defined
  - noisy and weakly labels (dense labeling is expensive)
  - High computational and memory cost
temporal structure: 需要对动作进行分解：decomposition
常用的 Deep networks
- large-scale video classification with CNN (feifei li, CVPR2014)
- Two-Stream CNN for action recognition in videos (NIPS2014)
- learning spatiotemporal features with 3D CNN (Du Tran, ICCV2015)
- TDD (liming wang, CVPR2015)
- Real-time action recognition with enhanced motion vector CNNs (CVPR2016)
- Two Stream I3D (CVPR2017)
- R(2+1)D (CVPR2018)
- SlowFast Networks (kaiming he, CVPR2019)
liming wang 自己的3篇工作
- shot-term -> middle-term -> long term modeling，对应的论文是 (ARTNet -> TSN -> UntrimmedNet)
- 更多细节理解，直接看他的PPT写读后感
- 按照 liming wang 自己说的，video action recognition/detection 对于我的VAD 基本没有帮助
之前看到一些很好的zhihu link: 动作识别-1, 动作识别-2, 时序行为检测-1, 时序行为检测-2, 时序行为检测-3, 时序行为检测-4,
所有的PPT图片

2.2 复杂视频的深度高效分析与理解方法 (yu qiao, CAS)

DL的一些经验性Trick介绍
人脸识别的开集特点 (Open-set 和 novelty detection有点类似，参考TODO)
- Center Loss (ECCV2016)
  
  center loss意思即为：为每一个类别提供一个类别中心，最小化min-batch中每个样本与对应类别中心的距离，这样就可以达到缩小类内距离的目的。
  
  center loss的原理主要是在softmax loss的基础上，通过对训练集的每个类别在特征空间分别维护一个类中心，在训练过程，增加样本经过网络映射后在特征空间与类中心的距离约束，从而兼顾了类内聚合与类间分离。
  
  From：link-1, link-2
- Center Loss的改进 (IJCV2019): 用投影方向代替类中心
- Large Margin思想设计的Loss:
  - L-softmax (Liu, ICML 2016)
  - A-softmax (Liu, CVPR2017)
  - Additive Margin Softmax (ICLR 2018 workshop)
  - CosLoss (wang, CVPR2018)
  - ArcFace (CVPR2019)
Range Loss ：有效应对类间样本数不均衡造成的长尾问题
- motivation：少数人(明星)具有大量的图片，多数人却只有少量图片，这种长尾分布启发了两个动机：(1)长尾分布如何影响模型性能的理论分析；(2)设计新的Loss解决这个问题
- 此处有一张图片，并且Range Loss 的 PPT缺失了
video action recognition
- 姿态注意力机制 RPAN (ICCV2017, Oral)
  - 把行为识别和姿态估计两个任务进行结合
  - 利用姿态变化，引导 RNN 对行为的动态过程进行建模
一篇文章
- Temporal Hallucinating for Action Recognition with Few Still Images (CVPR 2018)
一些图

2.3 understanding emotions in videos (yanwei fu, FDU)

个人感觉：这是个刚挖的新坑，有趣，值得了解下

applicaltion
- web video search
- video recommendation system
- avoid inappropriate advertisement
Tasks of Emotions in videos
- Emotion recognition
- emotion attribution
- emotion-oriented summarization
Challenges
- Sparsely expressed in videos
- Diverse content and variable quality
Knowledge Transfer
- Zero-shot Emotion learning (配一张图)
  - A multi-task neural approach for emotion attribution, classification and summarization (TMM)
  - Frame-Transfermer emotion classification Network (*CMR 2017)
Emotion-oriented summarization
- 相当于选择关键帧以及帧信息融合
Face emotion
- Posture, Expression, Identity in faces
一些图：

2.4 以人为中心视觉识别和定位中的结构化深度学习方法探索 (wanli ouyang, sdney university)

outline
- introduction
- structured feature learning
- back-bone model design
- conclusion
introduction
- object detection
- human pose estimation
- action recognition
structured feature learning
- structure in neurons
  - motivation：传统 neurons 在同一层没有连接，在相邻层存在局部或者全部连接，没有保证局部区域的信息。从而引出每一层网络的各神经元具有结构化信息的。然后以人体姿态估计为例，分析了基于全连接神经网络的问题：在对人体节点的距离进行建模需要大的卷积核以及一些关节点的关系是不稳定。提出结构化特征学习的人体姿态估计模型（Bidirectional Tree）。
  - Bidirectional Tree
  - 对应的papers
    - end-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation (CVPR2016)
    - structure feature learning for pose estimation (CVPR2016)
    - CRF-CNN, modeling structured information in human pose estimation (CVPR2016)
    - learning deep structured multi-scale features using attention-gated CRFs for contour prediction (NIPS 2017)
  - application of structured feature learning
    - 有一张图片
back-bone model design
- Hourglass for classification (Encode-Decoder 结构，比如 UNet，一般用于图像分割，不用于分类)
  - 希望: feature with high-level semantics and high resolution is good
  - 现实：feature with high-level semantics with low resolution
  - Hourglass for classification has poor performance的原因分析：Different tasks require different resolution of features，所以提出 FishNet
- FishNet
  - motivation: 为了统一利用像素级、区域级以及图像级任务的优势，欧阳万里老师提出了FishNet，FishNet的优势是：更好的将梯度传到浅层网络，所提取的特征包含了丰富的低层和高层语义信息并保留和微调了各层级信息。
  - pros.
    - better gradient flow to shallow layers
    - features:
      - contain rich low-level and high-level semantics
      - are preserved and refined from each other (信息互相交流)
  - code: https://github.com/kevin-ssy/FishNet
conclusion
- structured deep learning is (1)effective (2)from observation
- end-to-end joint training bridges the gap between structure modeling and feature learning
一些图

2.5 面向监控视频的行为识别与理解 (xiaowei lin, SJU)

行为识别领域的task
- 基于轨迹的行为分析
- 面向任意视频的行为识别 (liming wang)
- 面向监控视频的行为识别
目标检测的几个点
- Tiny DSOD (BMVC 2018)
- Toward accurate one-stage object detection with AP-Loss (CVPR 2019)
- kill two birds with one stone: boosting both object detection accuracy and speed with adaptive patch-of-interest composition (2017)
若干应用
- 三维目标检测与姿态估计
- 多目标跟踪
- 基于目标检测、跟踪的实时场景统计分析
- 多相机跟踪
  - (correspondence structure Re-ID)
    - learning correspondence structure for person re-identification (TIP2017)
    - person re-identification with correspondence sturcture learning (ICCV 2015)
  - (Group Re-ID)
    - Group re-identification: leveraging and integrating multi-grain information, (MM2018)
  - 车载跨相机定位
  - 无人超市
  - 野生东北虎 Re-ID
行为识别
- 多尺度特征
  - action recognition with coarse-to-fine deep feature interation and asynchronous fusion (AAAI 2018)
- 时空异步guanlian
  - cross-stream selective networks for action recognition (CVPR workshop 2019)
- 时空行为定位
  - finding action tubes with an integrated sparse-to-dense framework (arxiv 2019)
- 监控行为识别
  - 有一张图
其他应用
- 实时行为/事件检测
- 基于轨迹的行为分析-个体行为分析
  - a tube-and-droplet-based apporach for representing and analyzing motion trajectories (TPAMI 2017)，非DL，无好感
- 基于轨迹的行为分析-轨迹聚类与挖掘
  - unsupervised trajectory clustering via adaptive multi-kernel-based shrinkage (ICCV 2015)，比较老。。。但可以以它为base，看最新的引用它的高质量论文即可
- 密集场景行为分析
  - a diffusion and clustering-based approach for finding coherent motions and understanding crowd scenes (TIP 2016)
  - finding coherent motions and semantic regions in crowd scenes: a diffusion and clustering apporach (ECCV 2014)
主页：link-1

转载于:https://www.cnblogs.com/LS1314/p/10885080.html

VALSE2019总结(2)-以人为中心的视觉理解相关推荐

专访｜黄铁军：人工智能不会以人为中心，但不要为此失落
1970 年,黄铁军出生在河北的一个农村.从小学到博士后,他的整个学习过程都在国内完成."我1988年考大学第一志愿报考北京大学物理系,没被录取.有幸武汉工业大学计算机专业收留了我,本科到博 ...
荣耀MagicOS 7.0正式发布打造以人为中心的智慧生活解决方案
2022年11月22日至23日,荣耀MagicOS发布会暨首届开发者大会面向消费者.合作伙伴.开发者等举办.AI使能的个人化全场景智慧操作系统荣耀MagicOS 7.0正式发布, 其四大根技术Magi ...
李飞飞重返斯坦福后的大动作：开启「以人为中心的AI计划」
来源:网络大数据刚刚,李飞飞宣布斯坦福开启「以人为中心的 AI 计划」(Human-Centered AI Initiative,HAI),该项目由李飞飞和斯坦福大学前教务长 John Etchem ...
腾讯智慧交通战略重磅升级打造以人为中心的未来交通
在新基建加速布局下,智慧交通正在成为新基建的主力军,不仅可以助力新基建与传统基建融合,还将推动智慧城市建设,推动我国实现"交通大国"向"交通强国"的升级.9月1 ...
跳出以人为中心，从事情发展的角度看问题本质
工具让人类可以征服其他大型动物火可以让人类战胜大型植物(烧掉他) 人类训化农作物,反过来看也是:农作物凭借人类外力,战胜其他植物的过程人类驯服家畜,反过来看也是:这些动物跟人类共生共赢的结果我们 ...
重磅 | 李飞飞最新演讲：ImageNet后，我专注于这五件事——视觉理解、场景图，段落整合、视频分割及CLEVR数据集
2017中国计算机大会(CNCC2017)于10月26日在福州海峡国际会展中心开幕,大会为期3天. 而就在今天上午,李飞飞.沈向洋.汤道生.马维英等重磅大咖纷纷登台演讲. 据悉,斯坦福大学人工智能实验 ...
重磅 | 李飞飞最新演讲：ImageNet后，我专注于这五件事——视觉理解、场景图，段落整合、视频分割及CLEVR数据集...
2017中国计算机大会(CNCC2017)于10月26日在福州海峡国际会展中心开幕,大会为期3天. 而就在今天上午,李飞飞.沈向洋.汤道生.马维英等重磅大咖纷纷登台演讲. 据悉,斯坦福大学人工智能实验 ...
中心极限定理-通俗理解
中心极限定理-通俗理解: 1.大量相互独立的随机变量,其求和后的平均值服从正态分布,分布是指按照每个平均值的出现频数去判断分布 2.给定一个任意分布的总体.每次从这些总体中随机抽取 n 个抽样,一共抽 ...
北京招聘 | 美团视觉智能中心招聘视觉算法实习生
合适的工作难找?最新的招聘信息也不知道? AI 求职为大家精选人工智能领域最新鲜的招聘信息,助你先人一步投递,快人一步入职! 美团美团视觉智能中心是美团唯一的视觉平台型部门,以美团生活服务场景为驱动 ...

VALSE2019总结(2)-以人为中心的视觉理解

2. 以人为中心的视觉理解 (ceiwu lu, SJU)

2.1 基于视频的时序建模和动作识别方法 (liming wang, NJU)

2.2 复杂视频的深度高效分析与理解方法 (yu qiao, CAS)

2.3 understanding emotions in videos (yanwei fu, FDU)

2.4 以人为中心视觉识别和定位中的结构化深度学习方法探索 (wanli ouyang, sdney university)

2.5 面向监控视频的行为识别与理解 (xiaowei lin, SJU)

VALSE2019总结(2)-以人为中心的视觉理解相关推荐

最新文章

热门文章