Open AI 自监督学习笔记:Self-Supervised Learning | Tutorial | NeurIPS 2021
转载自微信公众号
原文链接: https://mp.weixin.qq.com/s?__biz=Mzg4MjgxMjgyMg==&mid=2247486049&idx=1&sn=1d98375dcbb9d0d68e8733f2dd0a2d40&chksm=cf51b898f826318ead24e414144235cfd516af4abb71190aeca42b1082bd606df6973eb963f0#rd
Open AI 自监督学习笔记
文章目录
- Open AI 自监督学习笔记
- Outline
- Introduction
- What is self-supervised learning?
- What's Possible with Self-Supervised Learning?
- Early Work
- Early Work: Connecting the Dots
- Restricted Boltzmann Machines
- Autoencoder: Self-Supervised Learning for Vision in Early Days
- Word2Vec: Self-Supervised Learning for Language
- Autoregressive Modeling
- Siamese Networks
- Multiple Instance Learning & Metric Learning
- Methods
- Methods for Framing Self-Supervised Learning Tasks
- Self-Prediction
- Self-prediction: Autoregressive Generation
- Self-Prediction: Masked Generation
- Self-Prediction: Innate Relationship Prediction
- Self-Prediction: Hybrid Self-Prediction Models
- Contrastive Learning
- Contrastive Learning: Inter-Sample Classification
- Loss function 1: Contrastive loss
- Loss function 2: Triplet loss
- Loss function 3: N-pair loss
- Loss function 4: Lifted structured loss
- Loss function 5: Noise Contrastive Estimation (NCE)
- Loss function 6: InfoNCE
- Loss function 7: Soft-Nearest Neighbors Loss
- Contrastive Learning: Feature Clustering
- Contrastive Learning: Multiview Coding
- Contrastive Learning between Modalities
- Pretext tasks
- Recap: Pretext Tasks
- Pretext Tasks: Taxonomy
- Image / Vision Pretext Tasks
- Image Pretext Tasks: Varizational AutoEncoders
- Image Pretext Tasks: Generative Adversial Networks
- Vision Pretext Tasks: Autoregressive Image Generation
- Vision Pretext Tasks: Diffusion Model
- Vision Pretext Tasks: Masked Prediction
- Vision Pretext Tasks: Colorization and More
- Vision Pretext Tasks: Innate Relationship Prediction
- Contrastive Predictive Coding and InfoNCE
- Vision Pretext Tasks: Inter-Sample Classification
- Vision Pretext Tasks: Contrastive Learning
- Vision Pretext Tasks: Data Augmentation and Multiple Views
- Vision Pretext Tasks: Inter-Sample Classification
- MoCo
- SimCLR
- Barlow Twins
- Vision Pretext Tasks: Non-Contrastive Siamese Networks
- Vision Pretext Tasks: Feature Clustering with K-Means
- Vision Pretext Tasks: Feature Clustering with Sinkhorm-Knopp
- Vision Pretext Tasks: Feature Clustering to improve SSL
- Vision Pretext Tasks: Nearest-Neighbor
- Vision Pretext Tasks: Combining with Supervised Loss
- Video Pretext Tasks
- Video Pretext Tasks: Innate Relationship Prediction
- Video Pretext Tasks: Optical Flow
- Video Pretext Tasks: Sequence Ordering
- Video Pretext Tasks: COlorization
- Video Pretext Tasks: Contrastive Multi-View Learning
- Video Pretext Task: Autoregressive Generation
- Audio Pretext Tasks
- Audio Pretext Tasks: Contrastive Learning
- Audio Pretext Task: Masked Languagee Modeling for ASR
- Multimodal Pretext Tasks
- Language Pretext Tasks
- Language Pretext Tasks: Generative Language Modeling
- Language Pretext Tasks: Sentence Embedding
- Training Techniques
- Techniques: Data augmentation
- Techniques: Data augmentation -- Image Augmentation
- Techniques: Data augmentation -- Text Augmentation
- Hard Negative Mining
- What is "hard negative mining"
- Explicit hard negative mining
- Implicit hard negative mining
- Theories
- Contrastive learning captures shared information betweem views
- The InfoMin Principle
- Alignment and Uniformity on the Hypersphere
- Dimensional Collapse
- Provable Guarantees for Contrastive Learning
- Feature directions
- Future Directions
Video: https://www.youtube.com/watch?v=7l6fttRJzeU
Slides: https://nips.cc/media/neurips-2021/Slides/21895.pdf
Self-Supervised Learning
– Self-Prediction and Contrastive Learning
- Self-Supervised Learning
- a popular paradigm of representation learning
Outline
- Introduction: motivation, basic concepts, examples
- Early Work: Look into connection with old methods
- Methods
- Self-prediction
- Contrastive Learning
- (for each subsection, present the framework and categorization)
- Pretext tasks: a wide range of literature review
- Techniques: improve training efficiency
Introduction
What is self-supervised learning and why we need it?
What is self-supervised learning?
- Self-supervised learning (SSL):
- a special type of representation learning that enables learning good data representation from unlablled dataset
- Motivation :
the idea of constructing supervised learning tasks out of unsupervised datasets
Why?
✅ Data labeling is expensive and thus high-quality dataset is limited
✅ Learning good representation makes it easier to transfer useful information to a variety of downstream tasks ⇒ \Rightarrow ⇒ e.g. Few-shot learning / Zero-shot transfer to new tasks
Self-supervised learning tasks are also known as pretext tasks
What’s Possible with Self-Supervised Learning?
Video Colorization (Vondrick et al 2018)
a self-supervised learning method
resulting in a rich representation
can be used for video segmentation + unlabelled visual region tracking, without extra fine-tuning
just label the first frame
Zero-shot CLIP (Radford et al. 2021)
Despite of not training on supervised labels
Zero-shot CLIP classifier achieve great performance on challenging image-to-text classification tasks
Early Work
Precursors 先驱者 to recent self-supervised approaches
Early Work: Connecting the Dots
Some ideas:
Restricted Boltzmann Machines
Autoencoders
Word2Vec
Autogressive Modeling
Siamese networks
Multiple Instance / Metric Learning
Restricted Boltzmann Machines
- RBM:
a special case of markov random field
consisting of visible units and hidden units
has connections between any pair across visible and hidden units, but not within each group
Autoencoder: Self-Supervised Learning for Vision in Early Days
- Autoencoder: a precursor to the modren self-supervised approaches
- Such as Denoising Autoencoder
- Has inspired many self-learning approaches in later years
- such as masked language model (e.g. BERT), MAE
Word2Vec: Self-Supervised Learning for Language
- Word Embeddings to map words to vectors
- extract the feature of words
- idea:
- the sum of neighboring word embedding is predictive of the word in the middle
- An interesting phenomenon resulting from word2Vec:
you can observe linear substructures in the embedding space where the lines connecting comparable concepts such as the corresponding masculine and feminine words appear in roughly parallel lines
Autoregressive Modeling
Autoregressive model:
Autoregressive (AR) models are a class of time series models in which the value at a given time step is modeled as a linear function of previous values
NADE: Neural Autogressive Distribution Estimator
Autogressive model also has been a basis for many self-supervised methods such as gpt
Siamese Networks
Many contrastive self-supervised learning methods use a pair of neural networks and learned from their difference
– this idea can be tracked back to Siamese Networks
- Self-organizing neural networks
- where two neural networks take seperate but related parts of the input, and learns to maximize the agreement between the two outputs
- Siamese Networks
if you believe that one network F can well encode x and get a good representation f(x)
then, 对于两个不同的输入x1和x2,their distance can be d(x1,x2) = L(f(x1),f(x2))
the idea of running two identical CNN on two different inputs and then comparing them —— a Siamese network
Train by:
✅ If xi and xj are the same person, ∣ ∣ f ( x i ) − f ( x j ) ||f(xi)-f(xj) ∣∣f(xi)−f(xj) is small
✅ If xi and xj are the different person, ∣ ∣ f ( x i ) − f ( x j ) ||f(xi)-f(xj) ∣∣f(xi)−f(xj) is large
Multiple Instance Learning & Metric Learning
Predecessors of the predetestors of the recent contrastive learning techniques : multiple instance learning and metric learning
deviate frome the typical framework of empirical risk minimization
- define the objective function in terms of multiple samples from the dataset ⇒ \Rightarrow ⇒ multiple instance learning
ealy work:
- around non-linear dimensionality reduction
- 如multi-dimensional scaling and locally linear embedding
- better than PCA: can preserving the local structure of data samples
metric learning:
- x and y: two samples
- A: A learnable positive semi-definite matrix
contrastive Loss:
- use a spring system to decrease the distance between the same types of inputs, and increase between different type of inputs
Triplet loss
- another way to obtain a learned metric
- defined using 3 data points
- anchor, positive and negative
- the anchor point is learned to become similar to the positive, and dissimilar to the negative
N-pair loss:
- generalized triplet loss
- recent 对比学习 就以 N-pair loss 为原型
Methods
- self-prediction
- Contrastive learning
Methods for Framing Self-Supervised Learning Tasks
- Self-prediction: Given an individual data sample, the task is to predict one part of the sample given the other part
- 即 “Intra-sample” prediction
The part to be predicted pretends to be missing
- Contrastive learning: Given multiple data samples, the task is to predict the relationship among them
relationship: can be based on inner logics within data
✅ such as different camera views of the same scene
✅ or create multiple augmented version of the same sample
The multiple samples can be selected from the dataset based on some known logics (e.g., the order of words / sentences), or fabricated by altering the original version
即 we know the true relationship between samples but pretend to not know it
Self-Prediction
Self-prediction construct prediction tasks within every individual data sample
to predict a part of the data from the rest while pretending we don’t know that part
The following figure: demonstrate how flexible and diverse the options we have for constructing self-prediction learning tasks
✅ can mask any dimensions
分类:
- Autoregressive generation
- Masked generation
- Innate relationship prediction
- Hybrid self-prediction
Self-prediction: Autoregressive Generation
The autoregressive model predicts future behavior based on past behavior
- Any data that comes with an innate sequential order can be modeled with regression
Examples :
- Audio (WaveNet, WaveRNN)
- Autoregressive language modeling (GPT, XLNet)
- Images in raster scan (PixelCNN, PixelRNN, iGPT)
Self-Prediction: Masked Generation
mask a random portion of information and pretend it is missiing, irrespective of the natural sequence
- The model learns to predict the missing portion given other unmasked information
e.g.,
- predicting random words based on other words in the same context around it
Examples :
- Masked language modeling (BERT)
- Images with masked patch (denoising autoencoder, context autoencoder, colorization)
Self-Prediction: Innate Relationship Prediction
Some transformation (e.g., segmentation, rotation) of one data samples should maintain the original information of follow the desired innate logic
Examples
Order of image patches
✅ e.g., shuffle the patches
✅ e.g., relative position, jigsaw puzzle
Image rotation
Counting features across patches
Self-Prediction: Hybrid Self-Prediction Models
Hybrid Self-Prediction Models: Combines different type of generation modeling
- VQ-VAE + AR
- Jukebox (Dhariwal et al. 2020), DALL-E (Ramesh et al. 2021)
- VQ-VAE + AR + Adversial
VQGAN (Esser & Rombach et al. 2021)
VQ-VAE: to learn a discrete code book of context rich visual parts
A transformer model: trained to auto-aggressively modeling the color combination of this code book
Contrastive Learning
Goal:
To learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart
对比学习 can be applied to both supervised and unsupervised settings
- when working with unsupervised data, 对比学习 is one of the most powerful approach in the self-supervised learning
Category
Inter-sample classification
Open AI 自监督学习笔记:Self-Supervised Learning | Tutorial | NeurIPS 2021相关推荐
- Datacamp 笔记代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines
更多原始数据文档和JupyterNotebook Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python Datacamp ...
- Datacamp 笔记代码 Supervised Learning with scikit-learn 第一章 Classification
更多原始数据文档和JupyterNotebook Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python Datacamp ...
- 多示例学习(Multi Instance Learning)和 弱监督学习(Weakly Supervised Learning)
目录 弱监督: 多示例学习: 弱监督: 1. 弱在缺标签:标签是不完全的,有的有标签,有的无标签 2. 弱在标签不准确:有的标签正确,有的标签错误 3. 弱在标签不精准: 标签不是在样本上,而是在更高 ...
- 关于弱监督学习的详细介绍——A Brief Introduction to Weakly Supervised Learning
目录 介绍 主动学习 半监督学习 多实例学习 带噪学习 Snorkel 框架介绍 参考 介绍 在机器学习领域,学习任务可大致划分为两类,一种是监督学习,另一种是非监督学习.通常,两者都需要从包含大量训 ...
- A brief introduction to weakly supervised learning(简要介绍弱监督学习)
文章转载自http://www.cnblogs.com/ariel-dreamland/p/8566348.html A brief introduction to weakly supervised ...
- 弱监督学习 weakly supervised learning 笔记
周志华 A Brief Introduction to Weakly Supervised Learning 2018 引言 在机器学习领域,学习任务可以划分为监督学习.非监督学习.通常,两者都需要从 ...
- ML之SL:监督学习(Supervised Learning)的简介、应用、经典案例之详细攻略
ML之SL:监督学习(Supervised Learning)的简介.应用.经典案例之详细攻略 目录 监督学习(Supervised Learning)的简介 1.监督学习问题的两大类-分类问题和回归 ...
- 监督学习(supervised learning)与非监督学习(unsupervised learning)
一,监督学习(supervised learning): 监督学习(supervised learning)的任务是学习一个模型,使模型能够对任意给定的输入,对其相应的输出做出一个好的预测. 即:利用 ...
- Supervised learning/ Unsupervised learning监督学习/无监督学习
[机器学习]两种方法--监督学习和无监督学习(通俗理解) [机器学习] : 监督学习 (框架) 有监督学习与无监督学习的几大区别 目录 Supervised learning 监督学习 Unsuper ...
最新文章
- MySQL · myrocks · MyRocks之memtable切换与刷盘
- 100+Python编程题给你练~(附答案)
- C++ 对象的内存布局(上)
- 【USACO Mar08】 奶牛跑步 A-star k短路
- Linux部署Ant Design Pro项目及nginx部署
- 根据一个数字日期,判断这个日期是这一年的第几天
- linux noprobe参数,find 命令的参数详解
- 微软简化 Windows 10 上的 WSL 安装
- 使用盒子模型仿照优酷的页面片段
- Go 语言——基本类型
- opencv 骨架提取_抗爆墙方盛提取车间抗爆墙记录@温州贴吧
- java mysql 分页计算公式_关于Java的分页算法,急!
- cakephp2.X教程第一部分(基于cakephp1.3.4在线教程的改编)
- 超级计算机summit存储容量,美国Summit超级计算机:采用超过27000块NV计算卡
- 《全网搜索》6.1版 - 更新内容及下载链接
- javaweb做什么能赚钱_做一个完整的Java Web项目需要掌握的技能
- python 读excel中的sheet_python读取excel文件中所有sheet表格
- 手机计算机打字教程,分享电脑打字熟记技巧,想要学习赶紧来学哦!
- LightWave 3D 2019 for Mac(三维动画制作软件)
- Linux 命令一览表,持续更新中
热门文章
- Datacamp 笔记代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines