今日arXiv精选 | 14 篇 ICCV 2021 最新论文

关于 #今日arXiv精选

这是「AI 学术前沿」旗下的一档栏目，编辑将每日从arXiv中精选高质量论文，推送给读者。

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Comment: ICCV 2021. Project page: https://loctex.mit.edu/

Link: http://arxiv.org/abs/2108.11950

Abstract

Computer vision tasks such as object detection and semantic/instancesegmentation rely on the painstaking annotation of large training datasets. Inthis paper, we propose LocTex that takes advantage of the low-cost localizedtextual annotations (i.e., captions and synchronized mouse-over gestures) toreduce the annotation effort. We introduce a contrastive pre-training frameworkbetween images and captions and propose to supervise the cross-modal attentionmap with rendered mouse traces to provide coarse localization signals. Ourlearned visual features capture rich semantics (from free-form captions) andaccurate localization (from mouse traces), which are very effective whentransferred to various downstream vision tasks. Compared with ImageNetsupervised pre-training, LocTex can reduce the size of the pre-training datasetby 10x or the target dataset by 2x while achieving comparable or even improvedperformance on COCO instance segmentation. When provided with the same amountof annotations, LocTex achieves around 4% higher accuracy than the previousstate-of-the-art "vision+language" pre-training approach on the task of PASCALVOC image classification.

Probabilistic Modeling for Human Mesh Recovery

Comment: ICCV 2021. Project page: https://www.seas.upenn.edu/~nkolot/projects/prohmr

Link: http://arxiv.org/abs/2108.11944

Abstract

This paper focuses on the problem of 3D human reconstruction from 2Devidence. Although this is an inherently ambiguous problem, the majority ofrecent works avoid the uncertainty modeling and typically regress a singleestimate for a given input. In contrast to that, in this work, we propose toembrace the reconstruction ambiguity and we recast the problem as learning amapping from the input to a distribution of plausible 3D poses. Our approach isbased on the normalizing flows model and offers a series of advantages. Forconventional applications, where a single 3D estimate is required, ourformulation allows for efficient mode computation. Using the mode leads toperformance that is comparable with the state of the art among deterministicunimodal regression models. Simultaneously, since we have access to thelikelihood of each sample, we demonstrate that our model is useful in a seriesof downstream tasks, where we leverage the probabilistic nature of theprediction as a tool for more accurate estimation. These tasks includereconstruction from multiple uncalibrated views, as well as human modelfitting, where our model acts as a powerful image-based prior for meshrecovery. Our results validate the importance of probabilistic modeling, andindicate state-of-the-art performance across a variety of settings. Code andmodels are available at: https://www.seas.upenn.edu/~nkolot/projects/prohmr.

Semantically Coherent Out-of-Distribution Detection

Comment: 15 pages, 7 figures. Accepted by ICCV-2021. Project page: https://jingkang50.github.io/projects/scood

Link: http://arxiv.org/abs/2108.11941

Abstract

Current out-of-distribution (OOD) detection benchmarks are commonly built bydefining one dataset as in-distribution (ID) and all others as OOD. However,these benchmarks unfortunately introduce some unwanted and impractical goals,e.g., to perfectly distinguish CIFAR dogs from ImageNet dogs, even though theyhave the same semantics and negligible covariate shifts. These unrealisticgoals will result in an extremely narrow range of model capabilities, greatlylimiting their use in real applications. To overcome these drawbacks, were-design the benchmarks and propose the semantically coherentout-of-distribution detection (SC-OOD). On the SC-OOD benchmarks, existingmethods suffer from large performance degradation, suggesting that they areextremely sensitive to low-level discrepancy between data sources whileignoring their inherent semantics. To develop an effective SC-OOD detectionapproach, we leverage an external unlabeled set and design a concise frameworkfeatured by unsupervised dual grouping (UDG) for the joint modeling of ID andOOD data. The proposed UDG can not only enrich the semantic knowledge of themodel by exploiting unlabeled data in an unsupervised manner, but alsodistinguish ID/OOD samples to enhance ID classification and OOD detection taskssimultaneously. Extensive experiments demonstrate that our approach achievesstate-of-the-art performance on SC-OOD benchmarks. Code and benchmarks areprovided on our project page: https://jingkang50.github.io/projects/scood.

Mining Contextual Information Beyond Image for Semantic Segmentation

Comment: Accepted by ICCV2021

Link: http://arxiv.org/abs/2108.11819

Abstract

This paper studies the context aggregation problem in semantic imagesegmentation. The existing researches focus on improving the pixelrepresentations by aggregating the contextual information within individualimages. Though impressive, these methods neglect the significance of therepresentations of the pixels of the corresponding class beyond the inputimage. To address this, this paper proposes to mine the contextual informationbeyond individual images to further augment the pixel representations. We firstset up a feature memory module, which is updated dynamically during training,to store the dataset-level representations of various categories. Then, welearn class probability distribution of each pixel representation under thesupervision of the ground-truth segmentation. At last, the representation ofeach pixel is augmented by aggregating the dataset-level representations basedon the corresponding class probability distribution. Furthermore, by utilizingthe stored dataset-level representations, we also propose a representationconsistent learning strategy to make the classification head better addressintra-class compactness and inter-class dispersion. The proposed method couldbe effortlessly incorporated into existing segmentation frameworks (e.g., FCN,PSPNet, OCRNet and DeepLabV3) and brings consistent performance improvements.Mining contextual information beyond image allows us to report state-of-the-artperformance on various benchmarks: ADE20K, LIP, Cityscapes and COCO-Stuff.

A Hierarchical Assessment of Adversarial Severity

Comment: To appear on the ICCV2021 Workshop on Adversarial Robustness in the Real World

Link: http://arxiv.org/abs/2108.11785

Abstract

Adversarial Robustness is a growing field that evidences the brittleness ofneural networks. Although the literature on adversarial robustness is vast, adimension is missing in these studies: assessing how severe the mistakes are.We call this notion "Adversarial Severity" since it quantifies the downstreamimpact of adversarial corruptions by computing the semantic error between themisclassification and the proper label. We propose to study the effects ofadversarial noise by measuring the Robustness and Severity into a large-scaledataset: iNaturalist-H. Our contributions are: (i) we introduce novelHierarchical Attacks that harness the rich structured space of labels to createadversarial examples. (ii) These attacks allow us to benchmark the AdversarialRobustness and Severity of classification models. (iii) We enhance thetraditional adversarial training with a simple yet effective HierarchicalCurriculum Training to learn these nodes gradually within the hierarchicaltree. We perform extensive experiments showing that hierarchical defenses allowdeep models to boost the adversarial Robustness by 1.85% and reduce theseverity of all attacks by 0.17, on average.

Cross-category Video Highlight Detection via Set-based Learning

Comment: Accepted as poster presentation at International Conference on Computer Vision (ICCV), 2021

Link: http://arxiv.org/abs/2108.11770

Abstract

Autonomous highlight detection is crucial for enhancing the efficiency ofvideo browsing on social media platforms. To attain this goal in a data-drivenway, one may often face the situation where highlight annotations are notavailable on the target video category used in practice, while the supervisionon another video category (named as source video category) is achievable. Insuch a situation, one can derive an effective highlight detector on targetvideo category by transferring the highlight knowledge acquired from sourcevideo category to the target one. We call this problem cross-category videohighlight detection, which has been rarely studied in previous works. Fortackling such practical problem, we propose a Dual-Learner-based VideoHighlight Detection (DL-VHD) framework. Under this framework, we first design aSet-based Learning module (SL-module) to improve the conventional pair-basedlearning by assessing the highlight extent of a video segment under a broadercontext. Based on such learning manner, we introduce two different learners toacquire the basic distinction of target category videos and the characteristicsof highlight moments on source video category, respectively. These two types ofhighlight knowledge are further consolidated via knowledge distillation.Extensive experiments on three benchmark datasets demonstrate the superiorityof the proposed SL-module, and the DL-VHD method outperforms five typicalUnsupervised Domain Adaptation (UDA) algorithms on various cross-categoryhighlight detection tasks. Our code is available athttps://github.com/ChrisAllenMing/Cross_Category_Video_Highlight .

Spatio-Temporal Dynamic Inference Network for Group Activity Recognition

Comment: Accepted to ICCV2021

Link: http://arxiv.org/abs/2108.11743

Abstract

Group activity recognition aims to understand the activity performed by agroup of people. In order to solve it, modeling complex spatio-temporalinteractions is the key. Previous methods are limited in reasoning on apredefined graph, which ignores the inherent person-specific interactioncontext. Moreover, they adopt inference schemes that are computationallyexpensive and easily result in the over-smoothing problem. In this paper, wemanage to achieve spatio-temporal person-specific inferences by proposingDynamic Inference Network (DIN), which composes of Dynamic Relation (DR) moduleand Dynamic Walk (DW) module. We firstly propose to initialize interactionfields on a primary spatio-temporal graph. Within each interaction field, weapply DR to predict the relation matrix and DW to predict the dynamic walkoffsets in a joint-processing manner, thus forming a person-specificinteraction graph. By updating features on the specific graph, a person canpossess a global-level interaction field with a local initialization.Experiments indicate both modules' effectiveness. Moreover, DIN achievessignificant improvement compared to previous state-of-the-art methods on twopopular datasets under the same setting, while costing much less computationoverhead of the reasoning module.

Learning to Diversify for Single Domain Generalization

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.11726

Abstract

Domain generalization (DG) aims to generalize a model trained on multiplesource (i.e., training) domains to a distributionally different target (i.e.,test) domain. In contrast to the conventional DG that strictly requires theavailability of multiple source domains, this paper considers a more realisticyet challenging scenario, namely Single Domain Generalization (Single-DG),where only one source domain is available for training. In this scenario, thelimited diversity may jeopardize the model generalization on unseen targetdomains. To tackle this problem, we propose a style-complement module toenhance the generalization power of the model by synthesizing images fromdiverse distributions that are complementary to the source ones. Morespecifically, we adopt a tractable upper bound of mutual information (MI)between the generated and source samples and perform a two-step optimizationiteratively: (1) by minimizing the MI upper bound approximation for each samplepair, the generated images are forced to be diversified from the sourcesamples; (2) subsequently, we maximize the MI between the samples from the samesemantic category, which assists the network to learn discriminative featuresfrom diverse-styled images. Extensive experiments on three benchmark datasetsdemonstrate the superiority of our approach, which surpasses thestate-of-the-art single-DG methods by up to 25.14%.

A Robust Loss for Point Cloud Registration

Comment: Accepted to ICCV 2021

Link: http://arxiv.org/abs/2108.11682

Abstract

The performance of surface registration relies heavily on the metric used forthe alignment error between the source and target shapes. Traditionally, such ametric is based on the point-to-point or point-to-plane distance from thepoints on the source surface to their closest points on the target surface,which is susceptible to failure due to instability of the closest-pointcorrespondence. In this paper, we propose a novel metric based on theinterp points between the two shapes and a random straight line, whichdoes not assume a specific correspondence. We verify the effectiveness of thismetric by extensive experiments, including its direct optimization for a singleregistration problem as well as unsupervised learning for a set of registrationproblems. The results demonstrate that the algorithms utilizing our proposedmetric outperforms the state-of-the-art optimization-based and unsupervisedlearning-based methods.

SketchLattice: Latticed Representation for Sketch Manipulation

Comment: accepted to ICCV 2021

Link: http://arxiv.org/abs/2108.11636

Abstract

The key challenge in designing a sketch representation lies with handling theabstract and iconic nature of sketches. Existing work predominantly utilizeseither, (i) a pixelative format that treats sketches as natural imagesemploying off-the-shelf CNN-based networks, or (ii) an elaborately designedvector format that leverages the structural information of drawing orders usingsequential RNN-based methods. While the pixelative format lacks intuitiveexploitation of structural cues, sketches in vector format are absent in mostcases limiting their practical usage. Hence, in this paper, we propose alattice structured sketch representation that not only removes the bottleneckof requiring vector data but also preserves the structural cues that vectordata provides. Essentially, sketch lattice is a set of points sampled from thepixelative format of the sketch using a lattice graph. We show that our latticestructure is particularly amenable to structural changes that largely benefitssketch abstraction modeling for generation tasks. Our lattice representationcould be effectively encoded using a graph model, that uses significantly fewermodel parameters (13.5 times lesser) than existing state-of-the-art. Extensiveexperiments demonstrate the effectiveness of sketch lattice for sketchmanipulation, including sketch healing and image-to-sketch synthesis.

Few-shot Visual Relationship Co-localization

Comment: Accepted in ICCV 2021

Link: http://arxiv.org/abs/2108.11618

Abstract

In this paper, given a small bag of images, each containing a common butlatent predicate, we are interested in localizing visual subject-object pairsconnected via the common predicate in each of the images. We refer to thisnovel problem as visual relationship co-localization or VRC as an abbreviation.VRC is a challenging task, even more so than the well-studied objectco-localization task. This becomes further challenging when using just a fewimages, the model has to learn to co-localize visual subject-object pairsconnected via unseen predicates. To solve VRC, we propose an optimizationframework to select a common visual relationship in each image of the bag. Thegoal of the optimization framework is to find the optimal solution by learningvisual relationship similarity across images in a few-shot setting. To obtainrobust visual relationship representation, we utilize a simple yet effectivetechnique that learns relationship embedding as a translation vector fromvisual subject to visual object in a shared space. Further, to learn visualrelationship similarity, we utilize a proven meta-learning technique commonlyused for few-shot classification tasks. Finally, to tackle the combinatorialcomplexity challenge arising from an exponential number of feasible solutions,we use a greedy approximation inference algorithm that selects approximatelythe best solution. We extensively evaluate our proposed framework on variations of bag sizesobtained from two challenging public datasets, namely VrR-VG and VG-150, andachieve impressive visual co-localization performance.

Unsupervised Dense Deformation Embedding Network for Template-Free Shape Correspondence

Comment: 15 pages, 18 figures. Accepted to ICCV 2021

Link: http://arxiv.org/abs/2108.11609

Abstract

Shape correspondence from 3D deformation learning has attracted appealingacademy interests recently. Nevertheless, current deep learning based methodsrequire the supervision of dense annotations to learn per-point translations,which severely overparameterize the deformation process. Moreover, they fail tocapture local geometric details of original shape via global feature embedding.To address these challenges, we develop a new Unsupervised Dense DeformationEmbedding Network (i.e., UD^2E-Net), which learns to predict deformationsbetween non-rigid shapes from dense local features. Since it is non-trivial tomatch deformation-variant local features for deformation prediction, we developan Extrinsic-Intrinsic Autoencoder to frst encode extrinsic geometric featuresfrom source into intrinsic coordinates in a shared canonical shape, with whichthe decoder then synthesizes corresponding target features. Moreover, a boundedmaximum mean discrepancy loss is developed to mitigate the distributiondivergence between the synthesized and original features. To learn naturaldeformation without dense supervision, we introduce a coarse parameterizeddeformation graph, for which a novel trace and propagation algorithm isproposed to improve both the quality and effciency of the deformation. OurUD^2E-Net outperforms state-of-the-art unsupervised methods by 24% on FaustInter challenge and even supervised methods by 13% on Faust Intra challenge.

The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.11550

Abstract

It is fundamental for personal robots to reliably navigate to a specifiedgoal. To study this task, PointGoal navigation has been introduced in simulatedEmbodied AI environments. Recent advances solve this PointGoal navigation taskwith near-perfect accuracy (99.6% success) in photo-realistically simulatedenvironments, assuming noiseless egocentric vision, noiseless actuation, andmost importantly, perfect localization. However, under realistic noise modelsfor visual sensors and actuation, and without access to a "GPS and Compasssensor," the 99.6%-success agents for PointGoal navigation only succeed with0.3%. In this work, we demonstrate the surprising effectiveness of visualodometry for the task of PointGoal navigation in this realistic setting, i.e.,with realistic noise models for perception and actuation and without access toGPS and Compass sensors. We show that integrating visual odometry techniquesinto navigation policies improves the state-of-the-art on the popular HabitatPointNav benchmark by a large margin, improving success from 64.5% to 71.7%while executing 6.4 times faster.

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios

Comment: 8 pages,9 figures, VisDrone 2021 ICCV workshop

Link: http://arxiv.org/abs/2108.11539

Abstract

Object detection on drone-captured scenarios is a recent popular task. Asdrones always navigate in different altitudes, the object scale variesviolently, which burdens the optimization of networks. Moreover, high-speed andlow-altitude flight bring in the motion blur on the densely packed objects,which leads to great challenge of object distinction. To solve the two issuesmentioned above, we propose TPH-YOLOv5. Based on YOLOv5, we add one moreprediction head to detect different-scale objects. Then we replace the originalprediction heads with Transformer Prediction Heads (TPH) to explore theprediction potential with self-attention mechanism. We also integrateconvolutional block attention model (CBAM) to find attention region onscenarios with dense objects. To achieve more improvement of our proposedTPH-YOLOv5, we provide bags of useful strategies such as data augmentation,multiscale testing, multi-model integration and utilizing extra classifier.Extensive experiments on dataset VisDrone2021 show that TPH-YOLOv5 have goodperformance with impressive interpretability on drone-captured scenarios. OnDET-test-challenge dataset, the AP result of TPH-YOLOv5 are 39.18%, which isbetter than previous SOTA method (DPNetV3) by 1.81%. On VisDrone Challenge2021, TPHYOLOv5 wins 5th place and achieves well-matched results with 1st placemodel (AP 39.43%). Compared to baseline model (YOLOv5), TPH-YOLOv5 improvesabout 7%, which is encouraging and competitive.

今日arXiv精选 | 14 篇 ICCV 2021 最新论文相关推荐

今日arXiv精选 | 11篇ICCV 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Explain Me the Painting: Multi-Topic K ...
今日arXiv精选 | 9篇ICCV 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. The Power of Points for Modeling Human ...
今日arXiv精选 | 13 篇 ICCV 2021 最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. A QuadTree Image Representation for Co ...
今日arXiv精选 | 15篇ICCV 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Image In painting Applied to Art Compl ...
今日arXiv精选 | 14篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Effective Sequence-to-Sequence Dialogu ...
今日arXiv精选 | 31篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从 arXiv 中精选高质量论文,推送给读者. Analysis of Language Change in Colla ...
今日arXiv精选 | 15篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从 arXiv 中精选高质量论文,推送给读者. Beyond Preserved Accuracy: Evaluatin ...
今日arXiv精选 | 11篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Does Vision-and-Language Pretraining I ...
今日arXiv精选 | 13篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Classification-based Quality Estimatio ...

今日arXiv精选 | 14 篇 ICCV 2021 最新论文

今日arXiv精选 | 14 篇 ICCV 2021 最新论文相关推荐

最新文章

热门文章