转载自微信公众号
原文链接: https://mp.weixin.qq.com/s?__biz=Mzg4MjgxMjgyMg==&mid=2247486049&idx=1&sn=1d98375dcbb9d0d68e8733f2dd0a2d40&chksm=cf51b898f826318ead24e414144235cfd516af4abb71190aeca42b1082bd606df6973eb963f0#rd

Open AI 自监督学习笔记


文章目录

  • Open AI 自监督学习笔记
    • Outline
    • Introduction
      • What is self-supervised learning?
      • What's Possible with Self-Supervised Learning?
    • Early Work
      • Early Work: Connecting the Dots
      • Restricted Boltzmann Machines
      • Autoencoder: Self-Supervised Learning for Vision in Early Days
      • Word2Vec: Self-Supervised Learning for Language
      • Autoregressive Modeling
      • Siamese Networks
      • Multiple Instance Learning & Metric Learning
    • Methods
      • Methods for Framing Self-Supervised Learning Tasks
      • Self-Prediction
      • Self-prediction: Autoregressive Generation
      • Self-Prediction: Masked Generation
      • Self-Prediction: Innate Relationship Prediction
      • Self-Prediction: Hybrid Self-Prediction Models
      • Contrastive Learning
      • Contrastive Learning: Inter-Sample Classification
        • Loss function 1: Contrastive loss
        • Loss function 2: Triplet loss
        • Loss function 3: N-pair loss
        • Loss function 4: Lifted structured loss
        • Loss function 5: Noise Contrastive Estimation (NCE)
        • Loss function 6: InfoNCE
        • Loss function 7: Soft-Nearest Neighbors Loss
      • Contrastive Learning: Feature Clustering
      • Contrastive Learning: Multiview Coding
      • Contrastive Learning between Modalities
    • Pretext tasks
      • Recap: Pretext Tasks
      • Pretext Tasks: Taxonomy
      • Image / Vision Pretext Tasks
        • Image Pretext Tasks: Varizational AutoEncoders
        • Image Pretext Tasks: Generative Adversial Networks
        • Vision Pretext Tasks: Autoregressive Image Generation
        • Vision Pretext Tasks: Diffusion Model
        • Vision Pretext Tasks: Masked Prediction
        • Vision Pretext Tasks: Colorization and More
        • Vision Pretext Tasks: Innate Relationship Prediction
        • Contrastive Predictive Coding and InfoNCE
        • Vision Pretext Tasks: Inter-Sample Classification
        • Vision Pretext Tasks: Contrastive Learning
        • Vision Pretext Tasks: Data Augmentation and Multiple Views
        • Vision Pretext Tasks: Inter-Sample Classification
          • MoCo
          • SimCLR
          • Barlow Twins
        • Vision Pretext Tasks: Non-Contrastive Siamese Networks
        • Vision Pretext Tasks: Feature Clustering with K-Means
        • Vision Pretext Tasks: Feature Clustering with Sinkhorm-Knopp
        • Vision Pretext Tasks: Feature Clustering to improve SSL
        • Vision Pretext Tasks: Nearest-Neighbor
        • Vision Pretext Tasks: Combining with Supervised Loss
      • Video Pretext Tasks
        • Video Pretext Tasks: Innate Relationship Prediction
        • Video Pretext Tasks: Optical Flow
        • Video Pretext Tasks: Sequence Ordering
        • Video Pretext Tasks: COlorization
        • Video Pretext Tasks: Contrastive Multi-View Learning
        • Video Pretext Task: Autoregressive Generation
      • Audio Pretext Tasks
        • Audio Pretext Tasks: Contrastive Learning
        • Audio Pretext Task: Masked Languagee Modeling for ASR
      • Multimodal Pretext Tasks
      • Language Pretext Tasks
        • Language Pretext Tasks: Generative Language Modeling
        • Language Pretext Tasks: Sentence Embedding
    • Training Techniques
      • Techniques: Data augmentation
        • Techniques: Data augmentation -- Image Augmentation
        • Techniques: Data augmentation -- Text Augmentation
      • Hard Negative Mining
        • What is "hard negative mining"
        • Explicit hard negative mining
        • Implicit hard negative mining
    • Theories
      • Contrastive learning captures shared information betweem views
      • The InfoMin Principle
      • Alignment and Uniformity on the Hypersphere
      • Dimensional Collapse
      • Provable Guarantees for Contrastive Learning
    • Feature directions
      • Future Directions

Video: https://www.youtube.com/watch?v=7l6fttRJzeU
Slides: https://nips.cc/media/neurips-2021/Slides/21895.pdf

Self-Supervised Learning
– Self-Prediction and Contrastive Learning

  • Self-Supervised Learning

    • a popular paradigm of representation learning

Outline

  • Introduction: motivation, basic concepts, examples
  • Early Work: Look into connection with old methods
  • Methods
    • Self-prediction
    • Contrastive Learning
    • (for each subsection, present the framework and categorization)
  • Pretext tasks: a wide range of literature review
  • Techniques: improve training efficiency

Introduction

What is self-supervised learning and why we need it?

What is self-supervised learning?

  • Self-supervised learning (SSL):

    • a special type of representation learning that enables learning good data representation from unlablled dataset
  • Motivation :
    • the idea of constructing supervised learning tasks out of unsupervised datasets

    • Why?

      ✅ Data labeling is expensive and thus high-quality dataset is limited

      ✅ Learning good representation makes it easier to transfer useful information to a variety of downstream tasks ⇒ \Rightarrow e.g. Few-shot learning / Zero-shot transfer to new tasks

Self-supervised learning tasks are also known as pretext tasks

What’s Possible with Self-Supervised Learning?

  • Video Colorization (Vondrick et al 2018)

    • a self-supervised learning method

    • resulting in a rich representation

    • can be used for video segmentation + unlabelled visual region tracking, without extra fine-tuning

    • just label the first frame

  • Zero-shot CLIP (Radford et al. 2021)

    • Despite of not training on supervised labels

    • Zero-shot CLIP classifier achieve great performance on challenging image-to-text classification tasks

Early Work

Precursors 先驱者 to recent self-supervised approaches

Early Work: Connecting the Dots

Some ideas:

  • Restricted Boltzmann Machines

  • Autoencoders

  • Word2Vec

  • Autogressive Modeling

  • Siamese networks

  • Multiple Instance / Metric Learning

Restricted Boltzmann Machines

  • RBM:

    • a special case of markov random field

    • consisting of visible units and hidden units

    • has connections between any pair across visible and hidden units, but not within each group

Autoencoder: Self-Supervised Learning for Vision in Early Days

  • Autoencoder: a precursor to the modren self-supervised approaches

    • Such as Denoising Autoencoder
  • Has inspired many self-learning approaches in later years
    • such as masked language model (e.g. BERT), MAE

Word2Vec: Self-Supervised Learning for Language

  • Word Embeddings to map words to vectors

    • extract the feature of words
  • idea:
    • the sum of neighboring word embedding is predictive of the word in the middle

  • An interesting phenomenon resulting from word2Vec:

    • you can observe linear substructures in the embedding space where the lines connecting comparable concepts such as the corresponding masculine and feminine words appear in roughly parallel lines

Autoregressive Modeling

  • Autoregressive model:

    • Autoregressive (AR) models are a class of time series models in which the value at a given time step is modeled as a linear function of previous values

    • NADE: Neural Autogressive Distribution Estimator

  • Autogressive model also has been a basis for many self-supervised methods such as gpt

Siamese Networks

Many contrastive self-supervised learning methods use a pair of neural networks and learned from their difference
– this idea can be tracked back to Siamese Networks

  • Self-organizing neural networks

    • where two neural networks take seperate but related parts of the input, and learns to maximize the agreement between the two outputs
  • Siamese Networks
    • if you believe that one network F can well encode x and get a good representation f(x)

    • then, 对于两个不同的输入x1和x2,their distance can be d(x1,x2) = L(f(x1),f(x2))

    • the idea of running two identical CNN on two different inputs and then comparing them —— a Siamese network

    • Train by:

      ✅ If xi and xj are the same person, ∣ ∣ f ( x i ) − f ( x j ) ||f(xi)-f(xj) ∣∣f(xi)f(xj) is small

      ✅ If xi and xj are the different person, ∣ ∣ f ( x i ) − f ( x j ) ||f(xi)-f(xj) ∣∣f(xi)f(xj) is large

Multiple Instance Learning & Metric Learning

Predecessors of the predetestors of the recent contrastive learning techniques : multiple instance learning and metric learning

  • deviate frome the typical framework of empirical risk minimization

    • define the objective function in terms of multiple samples from the dataset ⇒ \Rightarrow multiple instance learning
  • ealy work:

    • around non-linear dimensionality reduction
    • 如multi-dimensional scaling and locally linear embedding
    • better than PCA: can preserving the local structure of data samples
  • metric learning:

    • x and y: two samples
    • A: A learnable positive semi-definite matrix
  • contrastive Loss:

    • use a spring system to decrease the distance between the same types of inputs, and increase between different type of inputs
  • Triplet loss

    • another way to obtain a learned metric
    • defined using 3 data points
    • anchor, positive and negative
    • the anchor point is learned to become similar to the positive, and dissimilar to the negative
  • N-pair loss:

    • generalized triplet loss
    • recent 对比学习 就以 N-pair loss 为原型

Methods

  • self-prediction
  • Contrastive learning

Methods for Framing Self-Supervised Learning Tasks

  • Self-prediction: Given an individual data sample, the task is to predict one part of the sample given the other part

    • 即 “Intra-sample” prediction

The part to be predicted pretends to be missing

  • Contrastive learning: Given multiple data samples, the task is to predict the relationship among them

    • relationship: can be based on inner logics within data

      ✅ such as different camera views of the same scene

      ✅ or create multiple augmented version of the same sample

The multiple samples can be selected from the dataset based on some known logics (e.g., the order of words / sentences), or fabricated by altering the original version
即 we know the true relationship between samples but pretend to not know it

Self-Prediction

  • Self-prediction construct prediction tasks within every individual data sample

    • to predict a part of the data from the rest while pretending we don’t know that part

    • The following figure: demonstrate how flexible and diverse the options we have for constructing self-prediction learning tasks

      ✅ can mask any dimensions

  • 分类:

    • Autoregressive generation
    • Masked generation
    • Innate relationship prediction
    • Hybrid self-prediction

Self-prediction: Autoregressive Generation

  • The autoregressive model predicts future behavior based on past behavior

    • Any data that comes with an innate sequential order can be modeled with regression
  • Examples :

    • Audio (WaveNet, WaveRNN)
    • Autoregressive language modeling (GPT, XLNet)
    • Images in raster scan (PixelCNN, PixelRNN, iGPT)

Self-Prediction: Masked Generation

  • mask a random portion of information and pretend it is missiing, irrespective of the natural sequence

    • The model learns to predict the missing portion given other unmasked information
  • e.g.,

    • predicting random words based on other words in the same context around it
  • Examples :

    • Masked language modeling (BERT)
    • Images with masked patch (denoising autoencoder, context autoencoder, colorization)

Self-Prediction: Innate Relationship Prediction

  • Some transformation (e.g., segmentation, rotation) of one data samples should maintain the original information of follow the desired innate logic

  • Examples

    • Order of image patches

      ✅ e.g., shuffle the patches

      ✅ e.g., relative position, jigsaw puzzle

    • Image rotation

    • Counting features across patches

Self-Prediction: Hybrid Self-Prediction Models

Hybrid Self-Prediction Models: Combines different type of generation modeling

  • VQ-VAE + AR

    • Jukebox (Dhariwal et al. 2020), DALL-E (Ramesh et al. 2021)
  • VQ-VAE + AR + Adversial
    • VQGAN (Esser & Rombach et al. 2021)

    • VQ-VAE: to learn a discrete code book of context rich visual parts

    • A transformer model: trained to auto-aggressively modeling the color combination of this code book

Contrastive Learning