今日arXiv精选 | 35篇顶会论文：ICCV/ CIKM/ ACM MM

关于 #今日arXiv精选

这是「AI 学术前沿」旗下的一档栏目，编辑将每日从arXiv中精选高质量论文，推送给读者。

TSI: an Ad Text Strength Indicator using Text-to-CTR and Semantic-Ad-Similarity

Comment: Accepted for publication at CIKM 2021

Link: http://arxiv.org/abs/2108.08226

Abstract

Coming up with effective ad text is a time consuming process, andparticularly challenging for small businesses with limited advertisingexperience. When an inexperienced advertiser onboards with a poorly written adtext, the ad platform has the opportunity to detect low performing ad text, andprovide improvement suggestions. To realize this opportunity, we propose an adtext strength indicator (TSI) which: (i) predicts the click-through-rate (CTR)for an input ad text, (ii) fetches similar existing ads to create aneighborhood around the input ad, (iii) and compares the predicted CTRs in theneighborhood to declare whether the input ad is strong or weak. In addition, assuggestions for ad text improvement, TSI shows anonymized versions of superiorads (higher predicted CTR) in the neighborhood. For (i), we propose a BERTbased text-to-CTR model trained on impressions and clicks associated with an adtext. For (ii), we propose a sentence-BERT based semantic-ad-similarity modeltrained using weak labels from ad campaign setup data. Offline experimentsdemonstrate that our BERT based text-to-CTR model achieves a significant liftin CTR prediction AUC for cold start (new) advertisers compared to bag-of-wordsbased baselines. In addition, our semantic-textual-similarity model for similarads retrieval achieves a precision@1 of 0.93 (for retrieving ads from the sameproduct category); this is significantly higher compared to unsupervisedTF-IDF, word2vec, and sentence-BERT baselines. Finally, we share promisingonline results from advertisers in the Yahoo (Verizon Media) ad platform wherea variant of TSI was implemented with sub-second end-to-end latency.

Learning Implicit User Profiles for Personalized Retrieval-Based Chatbot

Comment: Accepted by CIKM 2021,

Code: https://github.com/qhjqhj00/CIKM2021-IMPChat

Link: http://arxiv.org/abs/2108.07935

Abstract

In this paper, we explore the problem of developing personalized chatbots. Apersonalized chatbot is designed as a digital chatting assistant for a user.The key characteristic of a personalized chatbot is that it should have aconsistent personality with the corresponding user. It can talk the same way asthe user when it is delegated to respond to others' messages. We present aretrieval-based personalized chatbot model, namely IMPChat, to learn animplicit user profile from the user's dialogue history. We argue that theimplicit user profile is superior to the explicit user profile regardingaccessibility and flexibility. IMPChat aims to learn an implicit user profilethrough modeling user's personalized language style and personalizedpreferences separately. To learn a user's personalized language style, weelaborately build language models from shallow to deep using the user'shistorical responses; To model a user's personalized preferences, we explorethe conditional relations underneath each post-response pair of the user. Thepersonalized preferences are dynamic and context-aware: we assign higherweights to those historical pairs that are topically related to the currentquery when aggregating the personalized preferences. We match each responsecandidate with the personalized language style and personalized preference,respectively, and fuse the two matching signals to determine the final rankingscore. Comprehensive experiments on two large datasets show that our methodoutperforms all baseline models.

Pixel-Perfect Structure-from-Motion with Featuremetric Refinement

Comment: Accepted to ICCV 2021 for oral presentation

Link: http://arxiv.org/abs/2108.08291

Abstract

Finding local features that are repeatable across multiple views is acornerstone of sparse 3D reconstruction. The classical image matching paradigmdetects keypoints per-image once and for all, which can yield poorly-localizedfeatures and propagate large errors to the final geometry. In this paper, werefine two key steps of structure-from-motion by a direct alignment oflow-level image information from multiple views: we first adjust the initialkeypoint locations prior to any geometric estimation, and subsequently refinepoints and camera poses as a post-processing. This refinement is robust tolarge detection noise and appearance changes, as it optimizes a featuremetricerror based on dense features predicted by a neural network. This significantlyimproves the accuracy of camera poses and scene geometry for a wide range ofkeypoint detectors, challenging viewing conditions, and off-the-shelf deepfeatures. Our system easily scales to large image collections, enablingpixel-perfect crowd-sourced localization at scale. Our code is publiclyavailable at https://github.com/cvg/pixel-perfect-sfm as an add-on to thepopular SfM software COLMAP.

Deep Reparametrization of Multi-Frame Super-Resolution and Denoising

Comment: ICCV 2021 Oral

Link: http://arxiv.org/abs/2108.08286

Abstract

We propose a deep reparametrization of the maximum a posteriori formulationcommonly employed in multi-frame image restoration tasks. Our approach isderived by introducing a learned error metric and a latent representation ofthe target image, which transforms the MAP objective to a deep feature space.The deep reparametrization allows us to directly model the image formationprocess in the latent space, and to integrate learned image priors into theprediction. Our approach thereby leverages the advantages of deep learning,while also benefiting from the principled multi-frame fusion provided by theclassical MAP formulation. We validate our approach through comprehensiveexperiments on burst denoising and burst super-resolution datasets. Ourapproach sets a new state-of-the-art for both tasks, demonstrating thegenerality and effectiveness of the proposed formulation.

Stochastic Scene-Aware Motion Prediction

Comment: ICCV2021

Link: http://arxiv.org/abs/2108.08284

Abstract

A long-standing goal in computer vision is to capture, model, andrealistically synthesize human behavior. Specifically, by learning from data,our goal is to enable virtual humans to navigate within cluttered indoor scenesand naturally interact with objects. Such embodied behavior has applications invirtual reality, computer games, and robotics, while synthesized behavior canbe used as a source of training data. This is challenging because real humanmotion is diverse and adapts to the scene. For example, a person can sit or lieon a sofa in many places and with varying styles. It is necessary to model thisdiversity when synthesizing virtual humans that realistically performhuman-scene interactions. We present a novel data-driven, stochastic motionsynthesis method that models different styles of performing a given action witha target object. Our method, called SAMP, for Scene-Aware Motion Prediction,generalizes to target objects of various geometries while enabling thecharacter to navigate in cluttered scenes. To train our method, we collectedMoCap data covering various sitting, lying down, walking, and running styles.We demonstrate our method on complex indoor scenes and achieve superiorperformance compared to existing solutions. Our code and data are available forresearch at https://samp.is.tue.mpg.de.

End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.08265

Abstract

End-to-end approaches to autonomous driving commonly rely on expertdemonstrations. Although humans are good drivers, they are not good coaches forend-to-end algorithms that demand dense on-policy supervision. On the contrary,automated experts that leverage privileged information can efficiently generatelarge scale on-policy and off-policy demonstrations. However, existingautomated experts for urban driving make heavy use of hand-crafted rules andperform suboptimally even on driving simulators, where ground-truth informationis available. To address these issues, we train a reinforcement learning expertthat maps bird's-eye view images to continuous low-level actions. While settinga new performance upper-bound on CARLA, our expert is also a better coach thatprovides informative supervision signals for imitation learning agents to learnfrom. Supervised by our reinforcement learning coach, a baseline end-to-endagent with monocular camera-input achieves expert-level performance. Ourend-to-end agent achieves a 78% success rate while generalizing to a new townand new weather on the NoCrash-dense benchmark and state-of-the-art performanceon the more challenging CARLA LeaderBoard.

Towards Robust Human Trajectory Prediction in Raw Videos

Comment: 8 pages, 6 figures. Accepted by the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021)

Link: http://arxiv.org/abs/2108.08259

Abstract

Human trajectory prediction has received increased attention lately due toits importance in applications such as autonomous vehicles and indoor robots.However, most existing methods make predictions based on human-labeledtrajectories and ignore the errors and noises in detection and tracking. Inthis paper, we study the problem of human trajectory forecasting in raw videos,and show that the prediction accuracy can be severely affected by various typesof tracking errors. Accordingly, we propose a simple yet effective strategy tocorrect the tracking failures by enforcing prediction consistency over time.The proposed "re-tracking" algorithm can be applied to any existing trackingand prediction pipelines. Experiments on public benchmark datasets demonstratethat the proposed method can improve both tracking and prediction performancein challenging real-world scenarios. The code and data are available athttps://git.io/retracking-prediction.

LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-based 3D Detector

Comment: ICCV'21

Link: http://arxiv.org/abs/2108.08258

Abstract

Stereo-based 3D detection aims at detecting 3D object bounding boxes fromstereo images using intermediate depth maps or implicit 3D geometryrepresentations, which provides a low-cost solution for 3D perception. However,its performance is still inferior compared with LiDAR-based detectionalgorithms. To detect and localize accurate 3D bounding boxes, LiDAR-basedmodels can encode accurate object boundaries and surface normal directions fromLiDAR point clouds. However, the detection results of stereo-based detectorsare easily affected by the erroneous depth features due to the limitation ofstereo matching. To solve the problem, we propose LIGA-Stereo (LiDAR GeometryAware Stereo Detector) to learn stereo-based 3D detectors under the guidance ofhigh-level geometry-aware representations of LiDAR-based detection models. Inaddition, we found existing voxel-based stereo detectors failed to learnsemantic features effectively from indirect 3D supervisions. We attach anauxiliary 2D detection head to provide direct 2D semantic supervisions.Experiment results show that the above two strategies improved the geometricand semantic representation capabilities. Compared with the state-of-the-artstereo detector, our method has improved the 3D detection performance of cars,pedestrians, cyclists by 10.44%, 5.69%, 5.97% mAP respectively on the officialKITTI benchmark. The gap between stereo-based and LiDAR-based 3D detectors isfurther narrowed.

LOKI: Long Term and Key Intentions for Trajectory Prediction

Comment: ICCV 2021 (The dataset is available at https://usa.honda-ri.com/loki)

Link: http://arxiv.org/abs/2108.08236

Abstract

Recent advances in trajectory prediction have shown that explicit reasoningabout agents' intent is important to accurately forecast their motion. However,the current research activities are not directly applicable to intelligent andsafety critical systems. This is mainly because very few public datasets areavailable, and they only consider pedestrian-specific intents for a shorttemporal horizon from a restricted egocentric view. To this end, we proposeLOKI (LOng term and Key Intentions), a novel large-scale dataset that isdesigned to tackle joint trajectory and intention prediction for heterogeneoustraffic agents (pedestrians and vehicles) in an autonomous driving setting. TheLOKI dataset is created to discover several factors that may affect intention,including i) agent's own will, ii) social interactions, iii) environmentalconstraints, and iv) contextual information. We also propose a model thatjointly performs trajectory and intention prediction, showing that recurrentlyreasoning about intention can assist with trajectory prediction. We show ourmethod outperforms state-of-the-art trajectory prediction methods by upto$27\%$ and also provide a baseline for frame-wise intention estimation.

MBRS : Enhancing Robustness of DNN-based Watermarking by Mini-Batch of Real and Simulated JPEG Compression

Comment: 9 pages, 6 figures, received by ACM MM'21

Link: http://arxiv.org/abs/2108.08211

Abstract

Based on the powerful feature extraction ability of deep learningarchitecture, recently, deep-learning based watermarking algorithms have beenwidely studied. The basic framework of such algorithm is the auto-encoder likeend-to-end architecture with an encoder, a noise layer and a decoder. The keyto guarantee robustness is the adversarial training with the differential noiselayer. However, we found that none of the existing framework can well ensurethe robustness against JPEG compression, which is non-differential but is anessential and important image processing operation. To address suchlimitations, we proposed a novel end-to-end training architecture, whichutilizes Mini-Batch of Real and Simulated JPEG compression (MBRS) to enhancethe JPEG robustness. Precisely, for different mini-batches, we randomly chooseone of real JPEG, simulated JPEG and noise-free layer as the noise layer.Besides, we suggest to utilize the Squeeze-and-Excitation blocks which canlearn better feature in embedding and extracting stage, and propose a "messageprocessor" to expand the message in a more appreciate way. Meanwhile, toimprove the robustness against crop attack, we propose an additive diffusionblock into the network. The extensive experimental results have demonstratedthe superior performance of the proposed scheme compared with thestate-of-the-art algorithms. Under the JPEG compression with quality factorQ=50, our models achieve a bit error rate less than 0.01% for extractedmessages, with PSNR larger than 36 for the encoded images, which shows thewell-enhanced robustness against JPEG attack. Besides, under many otherdistortions such as Gaussian filter, crop, cropout and dropout, the proposedframework also obtains strong robustness. The code implemented by PyTorch\cite{2011torch7} is avaiable in https://github.com/jzyustc/MBRS.

Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.08202

Abstract

Internet video delivery has undergone a tremendous explosion of growth overthe past few years. However, the quality of video delivery system greatlydepends on the Internet bandwidth. Deep Neural Networks (DNNs) are utilized toimprove the quality of video delivery recently. These methods divide a videointo chunks, and stream LR video chunks and corresponding content-aware modelsto the client. The client runs the inference of models to super-resolve the LRchunks. Consequently, a large number of models are streamed in order to delivera video. In this paper, we first carefully study the relation between models ofdifferent chunks, then we tactfully design a joint training framework alongwith the Content-aware Feature Modulation (CaFM) layer to compress these modelsfor neural video delivery. {\bf With our method, each video chunk only requiresless than $1\% $ of original parameters to be streamed, achieving even betterSR performance.} We conduct extensive experiments across various SR backbones,video time length, and scaling factors to demonstrate the advantages of ourmethod. Besides, our method can be also viewed as a new approach of videocoding. Our primary experiments achieve better video quality compared with thecommercial H.264 and H.265 standard under the same storage cost, showing thegreat potential of the proposed method. Code is availableat:\url{https://github.com/Neural-video-delivery/CaFM-Pytorch-ICCV2021}

Masked Face Recognition Challenge: The InsightFace Track Report

Comment: The WebFace260M Track of ICCV-21 MFR Challenge is still open in https://github.com/deepinsight/insightface/tree/master/challenges/iccv21-mfr

Link: http://arxiv.org/abs/2108.08191

Abstract

During the COVID-19 coronavirus epidemic, almost everyone wears a facialmask, which poses a huge challenge to deep face recognition. In this workshop,we organize Masked Face Recognition (MFR) challenge and focus on bench-markingdeep face recognition methods under the existence of facial masks. In the MFRchallenge, there are two main tracks: the InsightFace track and the WebFace260Mtrack. For the InsightFace track, we manually collect a large-scale masked facetest set with 7K identities. In addition, we also collect a children test setincluding 14K identities and a multi-racial test set containing 242Kidentities. By using these three test sets, we build up an online model testingsystem, which can give a comprehensive evaluation of face recognition models.To avoid data privacy problems, no test image is released to the public. As thechallenge is still under-going, we will keep on updating the top-rankedsolutions as well as this report on the arxiv.

ME-PCN: Point Completion Conditioned on Mask Emptiness

Comment: to appear in ICCV 2021

Link: http://arxiv.org/abs/2108.08187

Abstract

Point completion refers to completing the missing geometries of an objectfrom incomplete observations. Main-stream methods predict the missing shapes bydecoding a global feature learned from the input point cloud, which often leadsto deficient results in preserving topology consistency and surface details. Inthis work, we present ME-PCN, a point completion network that leverages`emptiness' in 3D shape space. Given a single depth scan, previous methodsoften encode the occupied partial shapes while ignoring the empty regions (e.g.holes) in depth maps. In contrast, we argue that these `emptiness' cluesindicate shape boundaries that can be used to improve topology representationand detail granularity on surfaces. Specifically, our ME-PCN encodes both theoccupied point cloud and the neighboring `empty points'. It estimatescoarse-grained but complete and reasonable surface points in the first stage,followed by a refinement stage to produce fine-grained surface details.Comprehensive experiments verify that our ME-PCN presents better qualitativeand quantitative performance against the state-of-the-art. Besides, we furtherprove that our `emptiness' design is lightweight and easy to embed in existingmethods, which shows consistent effectiveness in improving the CD and EMDscores.

Effect of Parameter Optimization on Classical and Learning-based Image Matching Methods

Comment: 8 pages, 2 figures, 3 tables, ICCV 2021 TradiCV Workshop

Link: http://arxiv.org/abs/2108.08179

Abstract

Deep learning-based image matching methods are improved significantly duringthe recent years. Although these methods are reported to outperform theclassical techniques, the performance of the classical methods is not examinedin detail. In this study, we compare classical and learning-based methods byemploying mutual nearest neighbor search with ratio test and optimizing theratio test threshold to achieve the best performance on two differentperformance metrics. After a fair comparison, the experimental results onHPatches dataset reveal that the performance gap between classical andlearning-based methods is not that significant. Throughout the experiments, wedemonstrated that SuperGlue is the state-of-the-art technique for the imagematching problem on HPatches dataset. However, if a single parameter, namelyratio test threshold, is carefully optimized, a well-known traditional methodSIFT performs quite close to SuperGlue and even outperforms in terms of meanmatching accuracy (MMA) under 1 and 2 pixel thresholds. Moreover, a recentapproach, DFM, which only uses pre-trained VGG features as descriptors andratio test, is shown to outperform most of the well-trained learning-basedmethods. Therefore, we conclude that the parameters of any classical methodshould be analyzed carefully before comparing against a learning-basedtechnique.

Deployment of Deep Neural Networks for Object Detection on Edge AI Devices with Runtime Optimization

Comment: To present in ICCV 2021 (ERCVAD Workshop)

Link: http://arxiv.org/abs/2108.08166

Abstract

Deep neural networks have proven increasingly important for automotive sceneunderstanding with new algorithms offering constant improvements of thedetection performance. However, there is little emphasis on experiences andneeds for deployment in embedded environments. We therefore perform a casestudy of the deployment of two representative object detection networks on anedge AI platform. In particular, we consider RetinaNet for image-based 2Dobject detection and PointPillars for LiDAR-based 3D object detection. Wedescribe the modifications necessary to convert the algorithms from a PyTorchtraining environment to the deployment environment taking into account theavailable tools. We evaluate the runtime of the deployed DNN using twodifferent libraries, TensorRT and TorchScript. In our experiments, we observeslight advantages of TensorRT for convolutional layers and TorchScript forfully connected layers. We also study the trade-off between runtime andperformance, when selecting an optimized setup for deployment, and observe thatquantization significantly reduces the runtime while having only little impacton the detection performance.

Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.08165

Abstract

Both generalized and incremental few-shot learning have to deal with threemajor challenges: learning novel classes from only few samples per class,preventing catastrophic forgetting of base classes, and classifier calibrationacross novel and base classes. In this work we propose a three-stage frameworkthat allows to explicitly and effectively address these challenges. While thefirst phase learns base classes with many samples, the second phase learns acalibrated classifier for novel classes from few samples while also preventingcatastrophic forgetting. In the final phase, calibration is achieved across allclasses. We evaluate the proposed framework on four challenging benchmarkdatasets for image and video few-shot classification and obtainstate-of-the-art results for both generalized and incremental few shotlearning.

Specificity-preserving RGB-D Saliency Detection

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.08162

Abstract

RGB-D saliency detection has attracted increasing attention, due to itseffectiveness and the fact that depth cues can now be conveniently captured.Existing works often focus on learning a shared representation through variousfusion strategies, with few methods explicitly considering how to preservemodality-specific characteristics. In this paper, taking a new perspective, wepropose a specificity-preserving network (SP-Net) for RGB-D saliency detection,which benefits saliency detection performance by exploring both the sharedinformation and modality-specific properties (e.g., specificity). Specifically,two modality-specific networks and a shared learning network are adopted togenerate individual and shared saliency maps. A cross-enhanced integrationmodule (CIM) is proposed to fuse cross-modal features in the shared learningnetwork, which are then propagated to the next layer for integratingcross-level information. Besides, we propose a multi-modal feature aggregation(MFA) module to integrate the modality-specific features from each individualdecoder into the shared decoder, which can provide rich complementarymulti-modal information to boost the saliency detection performance. Further, askip connection is used to combine hierarchical features between the encoderand decoder layers. Experiments on six benchmark datasets demonstrate that ourSP-Net outperforms other state-of-the-art methods. Code is available at:https://github.com/taozh2017/SPNet.

Single-DARTS: Towards Stable Architecture Search

Comment: Accepted by ICCV 2021 NeurArch Workshp

Link: http://arxiv.org/abs/2108.08128

Abstract

Differentiable architecture search (DARTS) marks a milestone in NeuralArchitecture Search (NAS), boasting simplicity and small search costs. However,DARTS still suffers from frequent performance collapse, which happens when someoperations, such as skip connections, zeroes and poolings, dominate thearchitecture. In this paper, we are the first to point out that the phenomenonis attributed to bi-level optimization. We propose Single-DARTS which merelyuses single-level optimization, updating network weights and architectureparameters simultaneously with the same data batch. Even single-leveloptimization has been previously attempted, no literature provides a systematicexplanation on this essential point. Replacing the bi-level optimization,Single-DARTS obviously alleviates performance collapse as well as enhances thestability of architecture search. Experiment results show that Single-DARTSachieves state-of-the-art performance on mainstream search spaces. Forinstance, on NAS-Benchmark-201, the searched architectures are nearly optimalones. We also validate that the single-level optimization framework is muchmore stable than the bi-level one. We hope that this simple yet effectivemethod will give some insights on differential architecture search. The code isavailable at https://github.com/PencilAndBike/Single-DARTS.git.

Target Adaptive Context Aggregation for Video Scene Graph Generation

Comment: ICCV 2021 camera-ready version

Link: http://arxiv.org/abs/2108.08121

Abstract

This paper deals with a challenging task of video scene graph generation(VidSGG), which could serve as a structured video representation for high-levelunderstanding tasks. We present a new {\em detect-to-track} paradigm for thistask by decoupling the context modeling for relation prediction from thecomplicated low-level entity tracking. Specifically, we design an efficientmethod for frame-level VidSGG, termed as {\em Target Adaptive ContextAggregation Network} (TRACE), with a focus on capturing spatio-temporal contextinformation for relation recognition. Our TRACE framework streamlines theVidSGG pipeline with a modular design, and presents two unique blocks ofHierarchical Relation Tree (HRTree) construction and Target-adaptive ContextAggregation. More specific, our HRTree first provides an adpative structure fororganizing possible relation candidates efficiently, and guides contextaggregation module to effectively capture spatio-temporal structureinformation. Then, we obtain a contextualized feature representation for eachrelation candidate and build a classification head to recognize its relationcategory. Finally, we provide a simple temporal association strategy to trackTRACE detected results to yield the video-level VidSGG. We perform experimentson two VidSGG benchmarks: ImageNet-VidVRD and Action Genome, and the resultsdemonstrate that our TRACE achieves the state-of-the-art performance. The codeand models are made available at \url{https://github.com/MCG-NJU/TRACE}.

Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.08119

Abstract

Learning RAW-to-sRGB mapping has drawn increasing attention in recent years,wherein an input raw image is trained to imitate the target sRGB image capturedby another camera. However, the severe color inconsistency makes it verychallenging to generate well-aligned training pairs of input raw and targetsRGB images. While learning with inaccurately aligned supervision is prone tocausing pixel shift and producing blurry results. In this paper, we circumventsuch issue by presenting a joint learning model for image alignment andRAW-to-sRGB mapping. To diminish the effect of color inconsistency in imagealignment, we introduce to use a global color mapping (GCM) module to generatean initial sRGB image given the input raw image, which can keep the spatiallocation of the pixels unchanged, and the target sRGB image is utilized toguide GCM for converting the color towards it. Then a pre-trained optical flowestimation network (e.g., PWC-Net) is deployed to warp the target sRGB image toalign with the GCM output. To alleviate the effect of inaccurately alignedsupervision, the warped target sRGB image is leveraged to learn RAW-to-sRGBmapping. When training is done, the GCM module and optical flow network can bedetached, thereby bringing no extra computation cost for inference. Experimentsshow that our method performs favorably against state-of-the-arts on ZRR andSR-RAW datasets. With our joint learning model, a light-weight backbone canachieve better quantitative and qualitative performance on ZRR dataset. Codesare available at https://github.com/cszhilu1998/RAW-to-sRGB.

Few-Shot Batch Incremental Road Object Detection via Detector Fusion

Comment: accepted in 2nd Autonomous Vehicle Vision Workshop, ICCV2021

Link: http://arxiv.org/abs/2108.08048

Abstract

Incremental few-shot learning has emerged as a new and challenging area indeep learning, whose objective is to train deep learning models using very fewsamples of new class data, and none of the old class data. In this work wetackle the problem of batch incremental few-shot road object detection usingdata from the India Driving Dataset (IDD). Our approach, DualFusion, combinesobject detectors in a manner that allows us to learn to detect rare objectswith very limited data, all without severely degrading the performance of thedetector on the abundant classes. In the IDD OpenSet incremental few-shotdetection task, we achieve a mAP50 score of 40.0 on the base classes and anoverall mAP50 score of 38.8, both of which are the highest to date. In the COCObatch incremental few-shot detection task, we achieve a novel AP score of 9.9,surpassing the state-of-the-art novel class performance on the same by over 6.6times.

Adaptive Graph Convolution for Point Cloud Analysis

Comment: Camera-ready, to be published in ICCV 2021

Link: http://arxiv.org/abs/2108.08035

Abstract

Convolution on 3D point clouds that generalized from 2D grid-like domains iswidely researched yet far from perfect. The standard convolution characterisesfeature correspondences indistinguishably among 3D points, presenting anintrinsic limitation of poor distinctive feature learning. In this paper, wepropose Adaptive Graph Convolution (AdaptConv) which generates adaptive kernelsfor points according to their dynamically learned features. Compared with usinga fixed/isotropic kernel, AdaptConv improves the flexibility of point cloudconvolutions, effectively and precisely capturing the diverse relations betweenpoints from different semantic parts. Unlike popular attentional weightschemes, the proposed AdaptConv implements the adaptiveness inside theconvolution operation instead of simply assigning different weights to theneighboring points. Extensive qualitative and quantitative evaluations showthat our method outperforms state-of-the-art point cloud classification andsegmentation approaches on several benchmark datasets. Our code is available athttps://github.com/hrzhou2/AdaptConv-master.

Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.08023

Abstract

In crowd counting, due to the problem of laborious labelling, it is perceivedintractability of collecting a new large-scale dataset which has plentifulimages with large diversity in density, scene, etc. Thus, for learning ageneral model, training with data from multiple different datasets might be aremedy and be of great value. In this paper, we resort to the multi-domainjoint learning and propose a simple but effective Domain-specific KnowledgePropagating Network (DKPNet)1 for unbiasedly learning the knowledge frommultiple diverse data domains at the same time. It is mainly achieved byproposing the novel Variational Attention(VA) technique for explicitly modelingthe attention distributions for different domains. And as an extension to VA,Intrinsic Variational Attention(InVA) is proposed to handle the problems ofover-lapped domains and sub-domains. Extensive experiments have been conductedto validate the superiority of our DKPNet over several popular datasets,including ShanghaiTech A/B, UCF-QNRF and NWPU.

Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.08020

Abstract

Co-speech gesture generation is to synthesize a gesture sequence that notonly looks real but also matches with the input speech audio. Our methodgenerates the movements of a complete upper body, including arms, hands, andthe head. Although recent data-driven methods achieve great success, challengesstill exist, such as limited variety, poor fidelity, and lack of objectivemetrics. Motivated by the fact that the speech cannot fully determine thegesture, we design a method that learns a set of gesture template vectors tomodel the latent conditions, which relieve the ambiguity. For our method, thetemplate vector determines the general appearance of a generated gesturesequence, while the speech audio drives subtle movements of the body, bothindispensable for synthesizing a realistic gesture sequence. Due to theintractability of an objective metric for gesture-speech synchronization, weadopt the lip-sync error as a proxy metric to tune and evaluate thesynchronization ability of our model. Extensive experiments show thesuperiority of our method in both objective and subjective evaluations onfidelity and synchronization.

RANK-NOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving

Comment: To Appear in ICCV2021.

Code: https://github.com/ruocwang

Link: http://arxiv.org/abs/2108.08019

Abstract

Predictor-based algorithms have achieved remarkable performance in the NeuralArchitecture Search (NAS) tasks. However, these methods suffer from highcomputation costs, as training the performance predictor usually requirestraining and evaluating hundreds of architectures from scratch. Previous worksalong this line mainly focus on reducing the number of architectures requiredto fit the predictor. In this work, we tackle this challenge from a differentperspective - improve search efficiency by cutting down the computation budgetof architecture training. We propose NOn-uniform Successive Halving (NOSH), ahierarchical scheduling algorithm that terminates the training ofunderperforming architectures early to avoid wasting budget. To effectivelyleverage the non-uniform supervision signals produced by NOSH, we formulatepredictor-based architecture search as learning to rank with pairwisecomparisons. The resulting method - RANK-NOSH, reduces the search budget by ~5xwhile achieving competitive or even better performance than previousstate-of-the-art predictor-based methods on various spaces and datasets.

Deep Hybrid Self-Prior for Full 3D Mesh Generation

Comment: Accepted by ICCV2021

Link: http://arxiv.org/abs/2108.08017

Abstract

We present a deep learning pipeline that leverages network self-prior torecover a full 3D model consisting of both a triangular mesh and a texture mapfrom the colored 3D point cloud. Different from previous methods eitherexploiting 2D self-prior for image editing or 3D self-prior for pure surfacereconstruction, we propose to exploit a novel hybrid 2D-3D self-prior in deepneural networks to significantly improve the geometry quality and produce ahigh-resolution texture map, which is typically missing from the output ofcommodity-level 3D scanners. In particular, we first generate an initial meshusing a 3D convolutional neural network with 3D self-prior, and then encodeboth 3D information and color information in the 2D UV atlas, which is furtherrefined by 2D convolutional neural networks with the self-prior. In this way,both 2D and 3D self-priors are utilized for the mesh and texture recovery.Experiments show that, without the need of any additional training data, ourmethod recovers the 3D textured mesh model of high quality from sparse input,and outperforms the state-of-the-art methods in terms of both the geometry andtexture quality.

Multi-Anchor Active Domain Adaptation for Semantic Segmentation

Comment: ICCV 2021 Oral

Link: http://arxiv.org/abs/2108.08012

Abstract

Unsupervised domain adaption has proven to be an effective approach foralleviating the intensive workload of manual annotation by aligning thesynthetic source-domain data and the real-world target-domain samples.Unfortunately, mapping the target-domain distribution to the source-domainunconditionally may distort the essential structural information of thetarget-domain data. To this end, we firstly propose to introduce a novelmulti-anchor based active learning strategy to assist domain adaptationregarding the semantic segmentation task. By innovatively adopting multipleanchors instead of a single centroid, the source domain can be bettercharacterized as a multimodal distribution, thus more representative andcomplimentary samples are selected from the target domain. With little workloadto manually annotate these active samples, the distortion of the target-domaindistribution can be effectively alleviated, resulting in a large performancegain. The multi-anchor strategy is additionally employed to model thetarget-distribution. By regularizing the latent representation of the targetsamples compact around multiple anchors through a novel soft alignment loss,more precise segmentation can be achieved. Extensive experiments are conductedon public datasets to demonstrate that the proposed approach outperformsstate-of-the-art methods significantly, along with thorough ablation study toverify the effectiveness of each component.

Structured Outdoor Architecture Reconstruction by Exploration and Classification

Comment: 2021 International Conference on Computer Vision (ICCV 2021)

Link: http://arxiv.org/abs/2108.07990

Abstract

This paper presents an explore-and-classify framework for structuredarchitectural reconstruction from an aerial image. Starting from a potentiallyimperfect building reconstruction by an existing algorithm, our approach 1)explores the space of building models by modifying the reconstruction viaheuristic actions; 2) learns to classify the correctness of building modelswhile generating classification labels based on the ground-truth, and 3)repeat. At test time, we iterate exploration and classification, seeking for aresult with the best classification score. We evaluate the approach usinginitial reconstructions by two baselines and two state-of-the-artreconstruction algorithms. Qualitative and quantitative evaluations demonstratethat our approach consistently improves the reconstruction quality from everyinitial reconstruction.

A New Journey from SDRTV to HDRTV

Comment: Accepted to ICCV

Link: http://arxiv.org/abs/2108.07978

Abstract

Nowadays modern displays are capable to render video content with highdynamic range (HDR) and wide color gamut (WCG). However, most availableresources are still in standard dynamic range (SDR). Therefore, there is anurgent demand to transform existing SDR-TV contents into their HDR-TV versions.In this paper, we conduct an analysis of SDRTV-to-HDRTV task by modeling theformation of SDRTV/HDRTV content. Base on the analysis, we propose a three-stepsolution pipeline including adaptive global color mapping, local enhancementand highlight generation. Moreover, the above analysis inspires us to present alightweight network that utilizes global statistics as guidance to conductimage-adaptive color mapping. In addition, we construct a dataset using HDRvideos in HDR10 standard, named HDRTV1K, and select five metrics to evaluatethe results of SDRTV-to-HDRTV algorithms. Furthermore, our final resultsachieve state-of-the-art performance in quantitative comparisons and visualquality. The code and dataset are available athttps://github.com/chxy95/HDRTVNet.

Thermal Image Processing via Physics-Inspired Deep Networks

Comment: Accepted to 2nd ICCV workshop on Learning for Computational Imaging (LCI)

Link: http://arxiv.org/abs/2108.07973

Abstract

We introduce DeepIR, a new thermal image processing framework that combinesphysically accurate sensor modeling with deep network-based imagerepresentation. Our key enabling observations are that the images captured bythermal sensors can be factored into slowly changing, scene-independent sensornon-uniformities (that can be accurately modeled using physics) and ascene-specific radiance flux (that is well-represented using a deepnetwork-based regularizer). DeepIR requires neither training data nor periodicground-truth calibration with a known black body target--making it well suitedfor practical computer vision tasks. We demonstrate the power of going DeepIRby developing new denoising and super-resolution algorithms that exploitmultiple images of the scene captured with camera jitter. Simulated and realdata experiments demonstrate that DeepIR can perform high-qualitynon-uniformity correction with as few as three images, achieving a 10dB PSNRimprovement over competing approaches.

SynFace: Face Recognition with Synthetic Data

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.07960

Abstract

With the recent success of deep neural networks, remarkable progress has beenachieved on face recognition. However, collecting large-scale real-worldtraining data for face recognition has turned out to be challenging, especiallydue to the label noise and privacy issues. Meanwhile, existing face recognitiondatasets are usually collected from web images, lacking detailed annotations onattributes (e.g., pose and expression), so the influences of differentattributes on face recognition have been poorly investigated. In this paper, weaddress the above-mentioned issues in face recognition using synthetic faceimages, i.e., SynFace. Specifically, we first explore the performance gapbetween recent state-of-the-art face recognition models trained with syntheticand real face images. We then analyze the underlying causes behind theperformance gap, e.g., the poor intra-class variations and the domain gapbetween synthetic and real face images. Inspired by this, we devise the SynFacewith identity mixup (IM) and domain mixup (DM) to mitigate the aboveperformance gap, demonstrating the great potentials of synthetic data for facerecognition. Furthermore, with the controllable face synthesis model, we caneasily manage different factors of synthetic face generation, including pose,expression, illumination, the number of identities, and samples per identity.Therefore, we also perform a systematically empirical analysis on syntheticface images to provide some insights on how to effectively utilize syntheticdata for face recognition.

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

Comment: Accepted to ICCV 2021

Link: http://arxiv.org/abs/2108.07954

Abstract

Advanced self-supervised visual representation learning methods rely on theinstance discrimination (ID) pretext task. We point out that the ID task has animplicit semantic consistency (SC) assumption, which may not hold inunconstrained datasets. In this paper, we propose a novel contrastive maskprediction (CMP) task for visual representation learning and design a maskcontrast (MaskCo) framework to implement the idea. MaskCo contrastsregion-level features instead of view-level features, which makes it possibleto identify the positive sample without any assumptions. To solve the domaingap between masked and unmasked features, we design a dedicated mask predictionhead in MaskCo. This module is shown to be the key to the success of the CMP.We evaluated MaskCo on training datasets beyond ImageNet and compare itsperformance with MoCo V2. Results show that MaskCo achieves comparableperformance with MoCo V2 using ImageNet training dataset, but demonstrates astronger performance across a range of downstream tasks when COCO or ConceptualCaptions are used for training. MaskCo provides a promising alternative to theID-based methods for self-supervised learning in the wild.

FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning

Comment: 10 pages, 9 figures. Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.07938

Abstract

In this paper, we propose a talking face generation method that takes anaudio signal as input and a short target video clip as reference, andsynthesizes a photo-realistic video of the target face with natural lipmotions, head poses, and eye blinks that are in-sync with the input audiosignal. We note that the synthetic face attributes include not only explicitones such as lip motions that have high correlations with speech, but alsoimplicit ones such as head poses and eye blinks that have only weak correlationwith the input audio. To model such complicated relationships among differentface attributes with input audio, we propose a FACe Implicit Attribute LearningGenerative Adversarial Network (FACIAL-GAN), which integrates thephonetics-aware, context-aware, and identity-aware information to synthesizethe 3D face animation with realistic motions of lips, head poses, and eyeblinks. Then, our Rendering-to-Video network takes the rendered face images andthe attention map of eye blinks as input to generate the photo-realistic outputvideo frames. Experimental results and user studies show our method cangenerate realistic talking face videos with not only synchronized lip motions,but also natural head movements and eye blinks, with better qualities than theresults of state-of-the-art methods.

Towards Interpreting Zoonotic Potential of Betacoronavirus Sequences With Attention

Comment: 11 pages, 8 figures, 1 table, accepted at ICLR 2021 workshop Machine learning for preventing and combating pandemics

Link: http://arxiv.org/abs/2108.08077

Abstract

Current methods for viral discovery target evolutionarily conserved proteinsthat accurately identify virus families but remain unable to distinguish thezoonotic potential of newly discovered viruses. Here, we apply anattention-enhanced long-short-term memory (LSTM) deep neural net classifier toa highly conserved viral protein target to predict zoonotic potential acrossbetacoronaviruses. The classifier performs with a 94% accuracy. Analysis andvisualization of attention at the sequence and structure-level featuresindicate possible association between important protein-protein interactionsgoverning viral replication in zoonotic betacoronaviruses and zoonotictransmission.

XAI Methods for Neural Time Series Classification: A Brief Review

Comment: 8 pages, 0 figures, Accepted as a poster presentation

Link: http://arxiv.org/abs/2108.08009

Abstract

Deep learning models have recently demonstrated remarkable results in avariety of tasks, which is why they are being increasingly applied inhigh-stake domains, such as industry, medicine, and finance. Considering thatautomatic predictions in these domains might have a substantial impact on thewell-being of a person, as well as considerable financial and legalconsequences to an individual or a company, all actions and decisions thatresult from applying these models have to be accountable. Given that asubstantial amount of data that is collected in high-stake domains are in theform of time series, in this paper we examine the current state of eXplainableAI (XAI) methods with a focus on approaches for opening up deep learning blackboxes for the task of time series classification. Finally, our contributionalso aims at deriving promising directions for future work, to advance XAI fordeep learning on time series data.

今日arXiv精选 | 35篇顶会论文：ICCV/ CIKM/ ACM MM相关推荐

今日arXiv精选 | 34篇顶会论文：CIKM/ ACL/ Interspeech/ ICCV/ ACM MM
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. DESYR: Definition and Syntactic Repres ...
今日arXiv精选 | 23篇顶会论文：ICASSP / ICCV / CIKM / ICME / AAAI
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. VerbCL: A Dataset of Verbatim Quotes f ...
今日arXiv精选 | 29篇顶会论文：ACM MM/ ICCV/ CIKM/ AAAI/ IJCAI
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Group-based Distinctive Image Captioni ...
今日arXiv精选 | 13篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Classification-based Quality Estimatio ...
今日arXiv精选 | 28篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Broaden the Vision: Geo-Diverse Visual ...
今日arXiv精选 | 12篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. You should evaluate your language mode ...
今日arXiv精选 | 13 篇 ICCV 2021 最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. A QuadTree Image Representation for Co ...
今日arXiv精选 | 11篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Does Vision-and-Language Pretraining I ...
今日arXiv精选 | 21篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Efficient Domain Adaptation of Languag ...

今日arXiv精选 | 35篇顶会论文：ICCV/ CIKM/ ACM MM

今日arXiv精选 | 35篇顶会论文：ICCV/ CIKM/ ACM MM相关推荐

最新文章

热门文章