今日arXiv精选 | 29篇顶会论文：ACM MM/ ICCV/ CIKM/ AAAI/ IJCAI

关于 #今日arXiv精选

这是「AI 学术前沿」旗下的一档栏目，编辑将每日从arXiv中精选高质量论文，推送给读者。

Group-based Distinctive Image Captioning with Memory Attention

Comment: Accepted at ACM MM 2021 (oral)

Link: http://arxiv.org/abs/2108.09151

Abstract

Describing images using natural language is widely known as image captioning,which has made consistent progress due to the development of computer visionand natural language generation techniques. Though conventional captioningmodels achieve high accuracy based on popular metrics, i.e., BLEU, CIDEr, andSPICE, the ability of captions to distinguish the target image from othersimilar images is under-explored. To generate distinctive captions, a fewpioneers employ contrastive learning or re-weighted the ground-truth captions,which focuses on one single input image. However, the relationships betweenobjects in a similar image group (e.g., items or properties within the samealbum or fine-grained events) are neglected. In this paper, we improve thedistinctiveness of image captions using a Group-based Distinctive CaptioningModel (GdisCap), which compares each image with other images in one similargroup and highlights the uniqueness of each image. In particular, we propose agroup-based memory attention (GMA) module, which stores object features thatare unique among the image group (i.e., with low similarity to objects in otherimages). These unique object features are highlighted when generating captions,resulting in more distinctive captions. Furthermore, the distinctive words inthe ground-truth captions are selected to supervise the language decoder andGMA. Finally, we propose a new evaluation metric, distinctive word rate(DisWordRate) to measure the distinctiveness of captions. Quantitative resultsindicate that the proposed method significantly improves the distinctiveness ofseveral baseline models, and achieves the state-of-the-art performance on bothaccuracy and distinctiveness. Results of a user study agree with thequantitative evaluation and demonstrate the rationality of the new metricDisWordRate.

Airbert: In-domain Pretraining for Vision-and-Language Navigation

Comment: To be published on ICCV 2021. Webpage is at https://airbert-vln.github.io/ linking to our dataset, codes and models

Link: http://arxiv.org/abs/2108.09105

Abstract

Vision-and-language navigation (VLN) aims to enable embodied agents tonavigate in realistic environments using natural language instructions. Giventhe scarcity of domain-specific training data and the high diversity of imageand language inputs, the generalization of VLN agents to unseen environmentsremains challenging. Recent methods explore pretraining to improvegeneralization, however, the use of generic image-caption datasets or existingsmall-scale VLN environments is suboptimal and results in limited improvements.In this work, we introduce BnB, a large-scale and diverse in-domain VLNdataset. We first collect image-caption (IC) pairs from hundreds of thousandsof listings from online rental marketplaces. Using IC pairs we next proposeautomatic strategies to generate millions of VLN path-instruction (PI) pairs.We further propose a shuffling loss that improves the learning of temporalorder inside PI pairs. We use BnB pretrain our Airbert model that can beadapted to discriminative and generative settings and show that it outperformsstate of the art for Room-to-Room (R2R) navigation and Remote ReferringExpression (REVERIE) benchmarks. Moreover, our in-domain pretrainingsignificantly increases performance on a challenging few-shot VLN evaluation,where we train the model only on VLN instructions from a few houses.

GEDIT: Geographic-Enhanced and Dependency-Guided Tagging for Joint POI and Accessibility Extraction at Baidu Maps

Comment: Accepted by CIKM'21

Link: http://arxiv.org/abs/2108.09104

Abstract

Providing timely accessibility reminders of a point-of-interest (POI) plays avital role in improving user satisfaction of finding places and making visitingdecisions. However, it is difficult to keep the POI database in sync with thereal-world counterparts due to the dynamic nature of business changes. Toalleviate this problem, we formulate and present a practical solution thatjointly extracts POI mentions and identifies their coupled accessibility labelsfrom unstructured text. We approach this task as a sequence tagging problem,where the goal is to producepairs fromunstructured text. This task is challenging because of two main issues: (1) POInames are often newly-coined words so as to successfully register new entitiesor brands and (2) there may exist multiple pairs in the text, whichnecessitates dealing with one-to-many or many-to-one mapping to make each POIcoupled with its accessibility label. To this end, we propose aGeographic-Enhanced and Dependency-guIded sequence Tagging (GEDIT) model toconcurrently address the two challenges. First, to alleviate challenge #1, wedevelop a geographic-enhanced pre-trained model to learn the textrepresentations. Second, to mitigate challenge #2, we apply a relational graphconvolutional network to learn the tree node representations from the parseddependency tree. Finally, we construct a neural sequence tagging model byintegrating and feeding the previously pre-learned representations into a CRFlayer. Extensive experiments conducted on a real-world dataset demonstrate thesuperiority and effectiveness of GEDIT. In addition, it has already beendeployed in production at Baidu Maps. Statistics show that the proposedsolution can save significant human effort and labor costs to deal with thesame amount of documents, which confirms that it is a practical way for POIaccessibility maintenance.

SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

Comment: Preprint of CIKM 2021 Resource Paper, 10 pages

Link: http://arxiv.org/abs/2108.09070

Abstract

Knowledge about software used in scientific investigations is important forseveral reasons, for instance, to enable an understanding of provenance andmethods involved in data handling. However, software is usually not formallycited, but rather mentioned informally within the scholarly description of theinvestigation, raising the need for automatic information extraction anddisambiguation. Given the lack of reliable ground truth data, we presentSoMeSci (Software Mentions in Science) a gold standard knowledge graph ofsoftware mentions in scientific articles. It contains high quality annotations(IRR: $\kappa{=}.82$) of 3756 software mentions in 1367 PubMed Centralarticles. Besides the plain mention of the software, we also provide relationlabels for additional information, such as the version, the developer, a URL orcitations. Moreover, we distinguish between different types, such asapplication, plugin or programming environment, as well as different types ofmentions, such as usage or creation. To the best of our knowledge, SoMeSci isthe most comprehensive corpus about software mentions in scientific articles,providing training samples for Named Entity Recognition, Relation Extraction,Entity Disambiguation, and Entity Linking. Finally, we sketch potential usecases and provide baseline results.

Twitter User Representation using Weakly Supervised Graph Embedding

Comment: accepted at 16th International AAAI Conference on Web and Social Media (ICWSM-2022), direct accept from May 2021 submission, 12 pages

Link: http://arxiv.org/abs/2108.08988

Abstract

Social media platforms provide convenient means for users to participate inmultiple online activities on various contents and create fast widespreadinteractions. However, this rapidly growing access has also increased thediverse information, and characterizing user types to understand people'slifestyle decisions shared in social media is challenging. In this paper, wepropose a weakly supervised graph embedding based framework for understandinguser types. We evaluate the user embedding learned using weak supervision overwell-being related tweets from Twitter, focusing on 'Yoga', 'Keto diet'.Experiments on real-world datasets demonstrate that the proposed frameworkoutperforms the baselines for detecting user types. Finally, we illustrate dataanalysis on different types of users (e.g., practitioner vs. promotional) fromour dataset. While we focus on lifestyle-related tweets (i.e., yoga, keto), ourmethod for constructing user representation readily generalizes to otherdomains.

SMedBERT: A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text Mining

Comment: ACL2021

Link: http://arxiv.org/abs/2108.08983

Abstract

Recently, the performance of Pre-trained Language Models (PLMs) has beensignificantly improved by injecting knowledge facts to enhance their abilitiesof language understanding. For medical domains, the background knowledgesources are especially useful, due to the massive medical terms and theircomplicated relations are difficult to understand in text. In this work, weintroduce SMedBERT, a medical PLM trained on large-scale medical corpora,incorporating deep structured semantic knowledge from neighbors oflinked-entity.In SMedBERT, the mention-neighbor hybrid attention is proposed tolearn heterogeneous-entity information, which infuses the semanticrepresentations of entity types into the homogeneous neighboring entitystructure. Apart from knowledge integration as external features, we propose toemploy the neighbors of linked-entities in the knowledge graph as additionalglobal contexts of text mentions, allowing them to communicate via sharedneighbors, thus enrich their semantic representations. Experiments demonstratethat SMedBERT significantly outperforms strong baselines in variousknowledge-intensive Chinese medical tasks. It also improves the performance ofother tasks such as question answering, question matching and natural languageinference.

Discriminative Region-based Multi-Label Zero-Shot Learning

Comment: Accepted to ICCV 2021. Source code is available at https://github.com/akshitac8/BiAM

Link: http://arxiv.org/abs/2108.09301

Abstract

Multi-label zero-shot learning (ZSL) is a more realistic counter-part ofstandard single-label ZSL since several objects can co-exist in a naturalimage. However, the occurrence of multiple objects complicates the reasoningand requires region-specific processing of visual features to preserve theircontextual cues. We note that the best existing multi-label ZSL method takes ashared approach towards attending to region features with a common set ofattention maps for all the classes. Such shared maps lead to diffusedattention, which does not discriminatively focus on relevant locations when thenumber of classes are large. Moreover, mapping spatially-pooled visual featuresto the class semantics leads to inter-class feature entanglement, thushampering the classification. Here, we propose an alternate approach towardsregion-based discriminability-preserving multi-label zero-shot classification.Our approach maintains the spatial resolution to preserve region-levelcharacteristics and utilizes a bi-level attention module (BiAM) to enrich thefeatures by incorporating both region and scene context information. Theenriched region-level features are then mapped to the class semantics and onlytheir class predictions are spatially pooled to obtain image-level predictions,thereby keeping the multi-class features disentangled. Our approach sets a newstate of the art on two large-scale multi-label zero-shot benchmarks: NUS-WIDEand Open Images. On NUS-WIDE, our approach achieves an absolute gain of 6.9%mAP for ZSL, compared to the best published results.

MG-GAN: A Multi-Generator Model Preventing Out-of-Distribution Samples in Pedestrian Trajectory Prediction

Comment: Accepted at ICCV 2021; Code available: https://github.com/selflein/MG-GAN

Link: http://arxiv.org/abs/2108.09274

Abstract

Pedestrian trajectory prediction is challenging due to its uncertain andmultimodal nature. While generative adversarial networks can learn adistribution over future trajectories, they tend to predict out-of-distributionsamples when the distribution of future trajectories is a mixture of multiple,possibly disconnected modes. To address this issue, we propose amulti-generator model for pedestrian trajectory prediction. Each generatorspecializes in learning a distribution over trajectories routing towards one ofthe primary modes in the scene, while a second network learns a categoricaldistribution over these generators, conditioned on the dynamics and sceneinput. This architecture allows us to effectively sample from specializedgenerators and to significantly reduce the out-of-distribution samples comparedto single generator methods.

Continual Learning for Image-Based Camera Localization

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.09112

Abstract

For several emerging technologies such as augmented reality, autonomousdriving and robotics, visual localization is a critical component. Directlyregressing camera pose/3D scene coordinates from the input image using deepneural networks has shown great potential. However, such methods assume astationary data distribution with all scenes simultaneously available duringtraining. In this paper, we approach the problem of visual localization in acontinual learning setup -- whereby the model is trained on scenes in anincremental manner. Our results show that similar to the classification domain,non-stationary data induces catastrophic forgetting in deep networks for visuallocalization. To address this issue, a strong baseline based on storing andreplaying images from a fixed buffer is proposed. Furthermore, we propose a newsampling method based on coverage score (Buff-CS) that adapts the existingsampling strategies in the buffering process to the problem of visuallocalization. Results demonstrate consistent improvements over standardbuffering methods on two challenging datasets -- 7Scenes, 12Scenes, and also19Scenes by combining the former scenes.

Single Image Defocus Deblurring Using Kernel-Sharing Parallel Atrous Convolutions

Comment: Accepted to ICCV 2021

Link: http://arxiv.org/abs/2108.09108

Abstract

This paper proposes a novel deep learning approach for single image defocusdeblurring based on inverse kernels. In a defocused image, the blur shapes aresimilar among pixels although the blur sizes can spatially vary. To utilize theproperty with inverse kernels, we exploit the observation that when only thesize of a defocus blur changes while keeping the shape, the shape of thecorresponding inverse kernel remains the same and only the scale changes. Basedon the observation, we propose a kernel-sharing parallel atrous convolutional(KPAC) block specifically designed by incorporating the property of inversekernels for single image defocus deblurring. To effectively simulate theinvariant shapes of inverse kernels with different scales, KPAC shares the sameconvolutional weights among multiple atrous convolution layers. To efficientlysimulate the varying scales of inverse kernels, KPAC consists of only a fewatrous convolution layers with different dilations and learns per-pixel scaleattentions to aggregate the outputs of the layers. KPAC also utilizes the shapeattention to combine the outputs of multiple convolution filters in each atrousconvolution layer, to deal with defocus blur with a slightly varying shape. Wedemonstrate that our approach achieves state-of-the-art performance with a muchsmaller number of parameters than previous methods.

Towards Understanding the Generative Capability of Adversarially Robust Classifiers

Comment: Accepted by ICCV 2021, Oral

Link: http://arxiv.org/abs/2108.09093

Abstract

Recently, some works found an interesting phenomenon that adversariallyrobust classifiers can generate good images comparable to generative models. Weinvestigate this phenomenon from an energy perspective and provide a novelexplanation. We reformulate adversarial example generation, adversarialtraining, and image generation in terms of an energy function. We find thatadversarial training contributes to obtaining an energy function that is flatand has low energy around the real data, which is the key for generativecapability. Based on our new understanding, we further propose a betteradversarial training method, Joint Energy Adversarial Training (JEAT), whichcan generate high-quality images and achieve new state-of-the-art robustnessunder a wide range of attacks. The Inception Score of the images (CIFAR-10)generated by JEAT is 8.80, much better than original robust classifiers (7.50).In particular, we achieve new state-of-the-art robustness on CIFAR-10 (from57.20% to 62.04%) and CIFAR-100 (from 30.03% to 30.18%) without extra trainingdata.

AutoLay: Benchmarking amodal layout estimation for autonomous driving

Comment: published in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Link: http://arxiv.org/abs/2108.09047

Abstract

Given an image or a video captured from a monocular camera, amodal layoutestimation is the task of predicting semantics and occupancy in bird's eyeview. The term amodal implies we also reason about entities in the scene thatare occluded or truncated in image space. While several recent efforts havetackled this problem, there is a lack of standardization in task specification,datasets, and evaluation protocols. We address these gaps with AutoLay, adataset and benchmark for amodal layout estimation from monocular images.AutoLay encompasses driving imagery from two popular datasets: KITTI andArgoverse. In addition to fine-grained attributes such as lanes, sidewalks, andvehicles, we also provide semantically annotated 3D point clouds. We implementseveral baselines and bleeding edge approaches, and release our data and code.

Out-of-boundary View Synthesis Towards Full-Frame Video Stabilization

Comment: 10 pages, 6 figures, accepted by ICCV2021

Link: http://arxiv.org/abs/2108.09041

Abstract

Warping-based video stabilizers smooth camera trajectory by constraining eachpixel's displacement and warp stabilized frames from unstable ones accordingly.However, since the view outside the boundary is not available during warping,the resulting holes around the boundary of the stabilized frame must bediscarded (i.e., cropping) to maintain visual consistency, and thus does leadsto a tradeoff between stability and cropping ratio. In this paper, we make afirst attempt to address this issue by proposing a new Out-of-boundary ViewSynthesis (OVS) method. By the nature of spatial coherence between adjacentframes and within each frame, OVS extrapolates the out-of-boundary view byaligning adjacent frames to each reference one. Technically, it firstcalculates the optical flow and propagates it to the outer boundary regionaccording to the affinity, and then warps pixels accordingly. OVS can beintegrated into existing warping-based stabilizers as a plug-and-play module tosignificantly improve the cropping ratio of the stabilized results. Inaddition, stability is improved because the jitter amplification effect causedby cropping and resizing is reduced. Experimental results on the NUS benchmarkshow that OVS can improve the performance of five representativestate-of-the-art methods in terms of objective metrics and subjective visualquality. The code is publicly available athttps://github.com/Annbless/OVS_Stabilization.

Video-based Person Re-identification with Spatial and Temporal Memory Networks

Comment: International Conference on Computer Vision (ICCV) 2021

Link: http://arxiv.org/abs/2108.09039

Abstract

Video-based person re-identification (reID) aims to retrieve person videoswith the same identity as a query person across multiple cameras. Spatial andtemporal distractors in person videos, such as background clutter and partialocclusions over frames, respectively, make this task much more challenging thanimage-based person reID. We observe that spatial distractors appearconsistently in a particular location, and temporal distractors show severalpatterns, e.g., partial occlusions occur in the first few frames, where suchpatterns provide informative cues for predicting which frames to focus on(i.e., temporal attentions). Based on this, we introduce a novel Spatial andTemporal Memory Networks (STMN). The spatial memory stores features for spatialdistractors that frequently emerge across video frames, while the temporalmemory saves attentions which are optimized for typical temporal patterns inperson videos. We leverage the spatial and temporal memories to refineframe-level person representations and to aggregate the refined frame-levelfeatures into a sequence-level person representation, respectively, effectivelyhandling spatial and temporal distractors in person videos. We also introduce amemory spread loss preventing our model from addressing particular items onlyin the memories. Experimental results on standard benchmarks, including MARS,DukeMTMC-VideoReID, and LS-VID, demonstrate the effectiveness of our method.

Is it Time to Replace CNNs with Transformers for Medical Images?

Comment: Originally published at the ICCV 2021 Workshop on Computer Vision for Automated Medical Diagnosis (CVAMD)

Link: http://arxiv.org/abs/2108.09038

Abstract

Convolutional Neural Networks (CNNs) have reigned for a decade as the defacto approach to automated medical image diagnosis. Recently, visiontransformers (ViTs) have appeared as a competitive alternative to CNNs,yielding similar levels of performance while possessing several interestingproperties that could prove beneficial for medical imaging tasks. In this work,we explore whether it is time to move to transformer-based models or if weshould keep working with CNNs - can we trivially switch to transformers? If so,what are the advantages and drawbacks of switching to ViTs for medical imagediagnosis? We consider these questions in a series of experiments on threemainstream medical image datasets. Our findings show that, while CNNs performbetter when trained from scratch, off-the-shelf vision transformers usingdefault hyperparameters are on par with CNNs when pretrained on ImageNet, andoutperform their CNN counterparts when pretrained using self-supervision.

AdvDrop: Adversarial Attack to DNNs by Dropping Information

Comment: Accepted to ICCV 2021

Link: http://arxiv.org/abs/2108.09034

Abstract

Human can easily recognize visual objects with lost information: even losingmost details with only contour reserved, e.g. cartoon. However, in terms ofvisual perception of Deep Neural Networks (DNNs), the ability for recognizingabstract objects (visual objects with lost information) is still a challenge.In this work, we investigate this issue from an adversarial viewpoint: will theperformance of DNNs decrease even for the images only losing a littleinformation? Towards this end, we propose a novel adversarial attack, named\textit{AdvDrop}, which crafts adversarial examples by dropping existinginformation of images. Previously, most adversarial attacks add extradisturbing information on clean images explicitly. Opposite to previous works,our proposed work explores the adversarial robustness of DNN models in a novelperspective by dropping imperceptible details to craft adversarial examples. Wedemonstrate the effectiveness of \textit{AdvDrop} by extensive experiments, andshow that this new type of adversarial examples is more difficult to bedefended by current defense systems.

Pixel Contrastive-Consistent Semi-Supervised Semantic Segmentation

Comment: To appear in ICCV 2021

Link: http://arxiv.org/abs/2108.09025

Abstract

We present a novel semi-supervised semantic segmentation method which jointlyachieves two desiderata of segmentation model regularities: the label-spaceconsistency property between image augmentations and the feature-spacecontrastive property among different pixels. We leverage the pixel-level L2loss and the pixel contrastive loss for the two purposes respectively. Toaddress the computational efficiency issue and the false negative noise issueinvolved in the pixel contrastive loss, we further introduce and investigateseveral negative sampling techniques. Extensive experiments demonstrate thestate-of-the-art performance of our method (PC2Seg) with the DeepLab-v3+architecture, in several challenging semi-supervised settings derived from theVOC, Cityscapes, and COCO datasets.

Online Continual Learning with Natural Distribution Shifts: An Empirical Study with Visual Data

Comment: Accepted to ICCV 2021

Link: http://arxiv.org/abs/2108.09020

Abstract

Continual learning is the problem of learning and retaining knowledge throughtime over multiple tasks and environments. Research has primarily focused onthe incremental classification setting, where new tasks/classes are added atdiscrete time intervals. Such an "offline" setting does not evaluate theability of agents to learn effectively and efficiently, since an agent canperform multiple learning epochs without any time limitation when a task isadded. We argue that "online" continual learning, where data is a singlecontinuous stream without task boundaries, enables evaluating both informationretention and online learning efficacy. In online continual learning, eachincoming small batch of data is first used for testing and then added to thetraining set, making the problem truly online. Trained models are laterevaluated on historical data to assess information retention. We introduce anew benchmark for online continual visual learning that exhibits large scaleand natural distribution shifts. Through a large-scale analysis, we identifycritical and previously unobserved phenomena of gradient-based optimization incontinual learning, and propose effective strategies for improvinggradient-based online continual learning with real data. The source code anddataset are available in: https://github.com/IntelLabs/continuallearning.

DeFRCN: Decoupled Faster R-CNN for Few-Shot Object Detection

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.09017

Abstract

Few-shot object detection, which aims at detecting novel objects rapidly fromextremely few annotated examples of previously unseen classes, has attractedsignificant research interest in the community. Most existing approaches employthe Faster R-CNN as basic detection framework, yet, due to the lack of tailoredconsiderations for data-scarce scenario, their performance is often notsatisfactory. In this paper, we look closely into the conventional Faster R-CNNand analyze its contradictions from two orthogonal perspectives, namelymulti-stage (RPN vs. RCNN) and multi-task (classification vs. localization). Toresolve these issues, we propose a simple yet effective architecture, namedDecoupled Faster R-CNN (DeFRCN). To be concrete, we extend Faster R-CNN byintroducing Gradient Decoupled Layer for multi-stage decoupling andPrototypical Calibration Block for multi-task decoupling. The former is a noveldeep layer with redefining the feature-forward operation and gradient-backwardoperation for decoupling its subsequent layer and preceding layer, and thelatter is an offline prototype-based classification model with taking theproposals from detector as input and boosting the original classificationscores with additional pairwise scores for calibration. Extensive experimentson multiple benchmarks show our framework is remarkably superior to otherexisting approaches and establishes a new state-of-the-art in few-shotliterature.

Dual Projection Generative Adversarial Networks for Conditional Image Generation

Comment: Accepted at ICCV-21

Link: http://arxiv.org/abs/2108.09016

Abstract

Conditional Generative Adversarial Networks (cGANs) extend the standardunconditional GAN framework to learning joint data-label distributions fromsamples, and have been established as powerful generative models capable ofgenerating high-fidelity imagery. A challenge of training such a model lies inproperly infusing class information into its generator and discriminator. Forthe discriminator, class conditioning can be achieved by either (1) directlyincorporating labels as input or (2) involving labels in an auxiliaryclassification loss. In this paper, we show that the former directly aligns theclass-conditioned fake-and-real data distributions$P(\text{image}|\text{class})$ ({\em data matching}), while the latter alignsdata-conditioned class distributions $P(\text{class}|\text{image})$ ({\em labelmatching}). Although class separability does not directly translate to samplequality and becomes a burden if classification itself is intrinsicallydifficult, the discriminator cannot provide useful guidance for the generatorif features of distinct classes are mapped to the same point and thus becomeinseparable. Motivated by this intuition, we propose a Dual Projection GAN(P2GAN) model that learns to balance between {\em data matching} and {\em labelmatching}. We then propose an improved cGAN model with Auxiliary Classificationthat directly aligns the fake and real conditionals$P(\text{class}|\text{image})$ by minimizing their $f$-divergence. Experimentson a synthetic Mixture of Gaussian (MoG) dataset and a variety of real-worlddatasets including CIFAR100, ImageNet, and VGGFace2 demonstrate the efficacy ofour proposed models.

GAN Inversion for Out-of-Range Images with Geometric Transformations

Comment: Accepted to ICCV 2021. For supplementary material, see https://kkang831.github.io/publication/ICCV_2021_BDInvert/

Link: http://arxiv.org/abs/2108.08998

Abstract

For successful semantic editing of real images, it is critical for a GANinversion method to find an in-domain latent code that aligns with the domainof a pre-trained GAN model. Unfortunately, such in-domain latent codes can befound only for in-range images that align with the training images of a GANmodel. In this paper, we propose BDInvert, a novel GAN inversion approach tosemantic editing of out-of-range images that are geometrically unaligned withthe training images of a GAN model. To find a latent code that is semanticallyeditable, BDInvert inverts an input out-of-range image into an alternativelatent space than the original latent space. We also propose a regularizedinversion method to find a solution that supports semantic editing in thealternative space. Our experiments show that BDInvert effectively supportssemantic editing of out-of-range images with geometric transformations.

Few Shot Activity Recognition Using Variational Inference

Comment: Accepted in IJCAI 2021 - 3RD INTERNATIONAL WORKSHOP ON DEEP LEARNING FOR HUMAN ACTIVITY RECOGNITION. arXiv admin note: text overlap with arXiv:1611.09630, arXiv:1909.07945 by other authors

Link: http://arxiv.org/abs/2108.08990

Abstract

There has been a remarkable progress in learning a model which couldrecognise novel classes with only a few labeled examples in the last few years.Few-shot learning (FSL) for action recognition is a challenging task ofrecognising novel action categories which are represented by few instances inthe training data. We propose a novel variational inference based architecturalframework (HF-AR) for few shot activity recognition. Our framework leveragesvolume-preserving Householder Flow to learn a flexible posterior distributionof the novel classes. This results in better performance as compared tostate-of-the-art few shot approaches for human activity recognition. approachconsists of base model and an adapter model. Our architecture consists of abase model and an adapter model. The base model is trained on seen classes andit computes an embedding that represent the spatial and temporal insightsextracted from the input video, e.g. combination of Resnet-152 and LSTM basedencoder-decoder model. The adapter model applies a series of Householdertransformations to compute a flexible posterior distribution that lends higheraccuracy in the few shot approach. Extensive experiments on three well-knowndatasets: UCF101, HMDB51 and Something-Something-V2, demonstrate similar orbetter performance on 1-shot and 5-shot classification as compared tostate-of-the-art few shot approaches that use only RGB frame sequence as input.To the best of our knowledge, we are the first to explore variational inferencealong with householder transformations to capture the full rank covariancematrix of posterior distribution, for few shot learning in activityrecognition.

Parsing Birdsong with Deep Audio Embeddings

Comment: IJCAI 2021 Artificial Intelligence for Social Good (AI4SG) Workshop

Link: http://arxiv.org/abs/2108.09203

Abstract

Monitoring of bird populations has played a vital role in conservationefforts and in understanding biodiversity loss. The automation of this processhas been facilitated by both sensing technologies, such as passive acousticmonitoring, and accompanying analytical tools, such as deep learning. However,machine learning models frequently have difficulty generalizing to examples notencountered in the training data. In our work, we present a semi-supervisedapproach to identify characteristic calls and environmental noise. We utilizeseveral methods to learn a latent representation of audio samples, including aconvolutional autoencoder and two pre-trained networks, and group the resultingembeddings for a domain expert to identify cluster labels. We show that ourapproach can improve classification precision and provide insight into thelatent structure of environmental acoustic datasets.

Reinforcement Learning to Optimize Lifetime Value in Cold-Start Recommendation

Comment: Accepted by CIKM 2021

Link: http://arxiv.org/abs/2108.09141

Abstract

Recommender system plays a crucial role in modern E-commerce platform. Due tothe lack of historical interactions between users and items, cold-startrecommendation is a challenging problem. In order to alleviate the cold-startissue, most existing methods introduce content and contextual information asthe auxiliary information. Nevertheless, these methods assume the recommendeditems behave steadily over time, while in a typical E-commerce scenario, itemsgenerally have very different performances throughout their life period. Insuch a situation, it would be beneficial to consider the long-term return fromthe item perspective, which is usually ignored in conventional methods.Reinforcement learning (RL) naturally fits such a long-term optimizationproblem, in which the recommender could identify high potential items,proactively allocate more user impressions to boost their growth, thereforeimprove the multi-period cumulative gains. Inspired by this idea, we model theprocess as a Partially Observable and Controllable Markov Decision Process(POC-MDP), and propose an actor-critic RL framework (RL-LTV) to incorporate theitem lifetime values (LTV) into the recommendation. In RL-LTV, the criticstudies historical trajectories of items and predict the future LTV of freshitem, while the actor suggests a score-based policy which maximizes the futureLTV expectation. Scores suggested by the actor are then combined with classicalranking scores in a dual-rank framework, therefore the recommendation isbalanced with the LTV consideration. Our method outperforms the strong livebaseline with a relative improvement of 8.67% and 18.03% on IPV and GMV ofcold-start items, on one of the largest E-commerce platform.

Lessons from the Clustering Analysis of a Search Space: A Centroid-based Approach to Initializing NAS

Comment: Accepted to the Workshop on 'Data Science Meets Optimisation' at IJCAI 2021

Link: http://arxiv.org/abs/2108.09126

Abstract

Lots of effort in neural architecture search (NAS) research has beendedicated to algorithmic development, aiming at designing more efficient andless costly methods. Nonetheless, the investigation of the initialization ofthese techniques remain scare, and currently most NAS methodologies rely onstochastic initialization procedures, because acquiring information prior tosearch is costly. However, the recent availability of NAS benchmarks haveenabled low computational resources prototyping. In this study, we propose toaccelerate a NAS algorithm using a data-driven initialization technique,leveraging the availability of NAS benchmarks. Particularly, we proposed atwo-step methodology. First, a calibrated clustering analysis of the searchspace is performed. Second, the centroids are extracted and used to initializea NAS algorithm. We tested our proposal using Aging Evolution, an evolutionaryalgorithm, on NAS-bench-101. The results show that, compared to a randominitialization, a faster convergence and a better performance of the finalsolution is achieved.

DL-Traff: Survey and Benchmark of Deep Learning Models for Urban Traffic Prediction

Comment: This paper has been accepted by CIKM 2021 Resource Track

Link: http://arxiv.org/abs/2108.09091

Abstract

Nowadays, with the rapid development of IoT (Internet of Things) and CPS(Cyber-Physical Systems) technologies, big spatiotemporal data are beinggenerated from mobile phones, car navigation systems, and traffic sensors. Byleveraging state-of-the-art deep learning technologies on such data, urbantraffic prediction has drawn a lot of attention in AI and IntelligentTransportation System community. The problem can be uniformly modeled with a 3Dtensor (T, N, C), where T denotes the total time steps, N denotes the size ofthe spatial domain (i.e., mesh-grids or graph-nodes), and C denotes thechannels of information. According to the specific modeling strategy, thestate-of-the-art deep learning models can be divided into three categories:grid-based, graph-based, and multivariate time-series models. In this study, wefirst synthetically review the deep traffic models as well as the widely useddatasets, then build a standard benchmark to comprehensively evaluate theirperformances with the same settings and metrics. Our study named DL-Traff isimplemented with two most popular deep learning frameworks, i.e., TensorFlowand PyTorch, which is already publicly available as two GitHub repositorieshttps://github.com/deepkashiwa20/DL-Traff-Grid andhttps://github.com/deepkashiwa20/DL-Traff-Graph. With DL-Traff, we hope todeliver a useful resource to researchers who are interested in spatiotemporaldata analysis.

FedSkel: Efficient Federated Learning on Heterogeneous Systems with Skeleton Gradients Update

Comment: CIKM 2021

Link: http://arxiv.org/abs/2108.09081

Abstract

Federated learning aims to protect users' privacy while performing dataanalysis from different participants. However, it is challenging to guaranteethe training efficiency on heterogeneous systems due to the variouscomputational capabilities and communication bottlenecks. In this work, wepropose FedSkel to enable computation-efficient and communication-efficientfederated learning on edge devices by only updating the model's essentialparts, named skeleton networks. FedSkel is evaluated on real edge devices withimbalanced datasets. Experimental results show that it could achieve up to5.52$\times$ speedups for CONV layers' back-propagation, 1.82$\times$ speedupsfor the whole training process, and reduce 64.8% communication cost, withnegligible accuracy loss.

ASAT: Adaptively Scaled Adversarial Training in Time Series

Comment: Accepted to be appeared in Workshop on Machine Learning in Finance (KDD-MLF) 2021

Link: http://arxiv.org/abs/2108.08976

Abstract

Adversarial training is a method for enhancing neural networks to improve therobustness against adversarial examples. Besides the security concerns ofpotential adversarial examples, adversarial training can also improve theperformance of the neural networks, train robust neural networks, and provideinterpretability for neural networks. In this work, we take the first step tointroduce adversarial training in time series analysis by taking the financefield as an example. Rethinking existing researches of adversarial training, wepropose the adaptively scaled adversarial training (ASAT) in time seriesanalysis, by treating data at different time slots with time-dependentimportance weights. Experimental results show that the proposed ASAT canimprove both the accuracy and the adversarial robustness of neural networks.Besides enhancing neural networks, we also propose the dimension-wiseadversarial sensitivity indicator to probe the sensitivities and importance ofinput dimensions. With the proposed indicator, we can explain the decisionbases of black box neural networks.

Explainable Reinforcement Learning for Broad-XAI: A Conceptual Framework and Survey

Comment: 22 pages, 7 figures

Link: http://arxiv.org/abs/2108.09003

Abstract

Broad Explainable Artificial Intelligence moves away from interpretingindividual decisions based on a single datum and aims to provide integratedexplanations from multiple machine learning algorithms into a coherentexplanation of an agent's behaviour that is aligned to the communication needsof the explainee. Reinforcement Learning (RL) methods, we propose, provide apotential backbone for the cognitive model required for the development ofBroad-XAI. RL represents a suite of approaches that have had increasing successin solving a range of sequential decision-making problems. However, thesealgorithms all operate as black-box problem solvers, where they obfuscate theirdecision-making policy through a complex array of values and functions.EXplainable RL (XRL) is relatively recent field of research that aims todevelop techniques to extract concepts from the agent's: perception of theenvironment; intrinsic/extrinsic motivations/beliefs; Q-values, goals andobjectives. This paper aims to introduce a conceptual framework, called theCausal XRL Framework (CXF), that unifies the current XRL research and uses RLas a backbone to the development of Broad-XAI. Additionally, we recognise thatRL methods have the ability to incorporate a range of technologies to allowagents to adapt to their environment. CXF is designed for the incorporation ofmany standard RL extensions and integrated with external ontologies andcommunication facilities so that the agent can answer questions that explainoutcomes and justify its decisions.

今日arXiv精选 | 29篇顶会论文：ACM MM/ ICCV/ CIKM/ AAAI/ IJCAI相关推荐

今日arXiv精选 | 23篇顶会论文：ICASSP / ICCV / CIKM / ICME / AAAI
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. VerbCL: A Dataset of Verbatim Quotes f ...
今日arXiv精选 | 34篇顶会论文：CIKM/ ACL/ Interspeech/ ICCV/ ACM MM
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. DESYR: Definition and Syntactic Repres ...
今日arXiv精选 | 35篇顶会论文：ICCV/ CIKM/ ACM MM
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. TSI: an Ad Text Strength Indicator usi ...
今日arXiv精选 | 21篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Efficient Domain Adaptation of Languag ...
今日arXiv精选 | 28篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Broaden the Vision: Geo-Diverse Visual ...
今日arXiv精选 | 11篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Does Vision-and-Language Pretraining I ...
今日arXiv精选 | 13篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Classification-based Quality Estimatio ...
今日arXiv精选 | 11篇ICCV 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Explain Me the Painting: Multi-Topic K ...
今日arXiv精选 | 46篇EMNLP 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Neural Machine Translation Quality and ...

今日arXiv精选 | 29篇顶会论文：ACM MM/ ICCV/ CIKM/ AAAI/ IJCAI

今日arXiv精选 | 29篇顶会论文：ACM MM/ ICCV/ CIKM/ AAAI/ IJCAI相关推荐

最新文章

热门文章