AIGC在营销图片生成技术综述

基于文本生成素材

imagen

分析用户输入的文本并使用T5-XXL进行编码。嵌入在 AI 中的文本首先被转换为分辨率为64x64像素的小图像。Imagen进一步利用文本条件超分辨率扩散模型对图像进行64×64的上采样，然后这个图像继续增长并最终形成。

Imagen 的开发者谷歌研究的大脑团队表示，基于变压器和图像扩散模型，Imagen实现了前所未有的真实感。谷歌声称，对比其它模型，在图像保真度和图像-文本匹配方面，人类评估者更喜欢 Imagen。

不过，谷歌也表示，Imagen 是在从网络上抓取的数据集上进行训练的，虽然已经过滤了很多不良内容如色情图像、污秽语言等，但仍有大量不当的内容数据集，因此也会存在种族主义诽谤和有害的社会刻板印象。

StableDiffusion

从名字Stable Diffusion就可以看出，这个主要采用的扩散模型（Diffusion Model）。

简单来说，扩散模型就是去噪自编码器的连续应用，逐步生成图像的过程。

一般所言的扩散，是反复在图像中添加小的、随机的噪声。而扩散模型则与这个过程相反——将噪声生成高清图像。训练的神经网络通常为U-net。

不过因为模型是直接在像素空间运行，导致扩散模型的训练、计算成本十分昂贵。

基于这样的背景下，Stable Diffusion主要分两步进行。

首先，使用编码器将图像x压缩为较低维的潜在空间表示z（x）。

其中上下文（Context）y，即输入的文本提示，用来指导x的去噪。

它与时间步长t一起，以简单连接和交叉两种方式，注入到潜在空间表示中去。

随后在z（x）基础上进行扩散与去噪。换言之，就是模型并不直接在图像上进行计算，从而减少了训练时间、效果更好。

值得一提的是，Stable DIffusion的上下文机制非常灵活，y不光可以是图像标签，就是蒙版图像、场景分割、空间布局，也能够相应完成。

素材多样化生成

基于文本ps多样化

DreamBooth

用户可以给定3-5张自己随意拍摄的某一物体的图片，就能得到不同背景下的该物体的新颖再现，同时又保留了其关键特征。

当然，作者也表示，这种方法并不局限于某个模型，如果DALL·E2经过一些调整，同样能实现这样的功能。

具体到方法上，DreamBooth采用了给物体加上“特殊标识符”的方法。

也就是说，原本图像生成模型收到的指令只是一类物体，例如[cat]、[dog]等，但现在DreamBooth会在这类物体前加上一个特殊标识符，变成[V][物体类别]。

以下图为例，将用户上传的三张狗子照片和相应的类名（如“狗”）作为输入信息，得到一个经过微调的文本-图像扩散模型。

该扩散模型用“a [V] dog”来特指用户上传图片中的狗子，再把其带入文字描述中，生成特定的图像，其中[V]就是那个特殊标识符。

至于为什么不直接用[V]来指代整个[特定物体]？

作者表示，受限于输入照片的数量，模型无法很好地学习到照片中物体的整体特征，反而可能出现过拟合。

因此这里采用了微调的思路，整体上仍然基于AI已经学到的[物体类别]特征，再用[V]学到的特殊特征来修饰它。

以生成一只白色的狗为例，这里模型会通过[V]来学习狗的颜色（白色）、体型等个性化细节，加上模型在[狗]这个大的类别中学到的狗的共性，就能生成更多合理又不失个性的白狗的照片。

为了训练这个微调的文本-图像扩散模型，研究人员首先根据给定的文本描述生成低分辨率图像，这时生成的图像中狗子的形象是随机的。

然后再应用超分辨率的扩散模型进行替换，把随机图像换成用户上传的特定狗子。

instructPix2Pix

InstructPix2Pix整合了目前较为成熟的两个大规模预训练模型：语言模型GPT-3和文本图像生成模型Stable Diffusion，生成了一个专用于图像编辑训练的数据集，随后训练了一个条件引导型的扩散模型来完成这一任务。此外，InstructPix2Pix模型可以在几秒钟内快速完成图像编辑操作，这进一步提高了InstructPix2Pix的可用性和实用性。

InstructPix2Pix模型的整体构建流程分为两大部分：（1）生成一个专用于图像编辑任务的数据集。（2）使用生成的数据集训练一个条件扩散模型，该模型可以按照人类的指令对目标图像进行各种形式的编辑操作，例如替换物体、更改图像本身的风格、修改图像的背景环境等等。

作者在InstructPix2Pix中整合了两个大规模预训练模型：语言模型GPT-3和文本图像模型Stable Diffusion，同时利用这两个模型中蕴含的知识构建了一个多模态训练数据集，该数据集主要包含了由文本编辑指令和编辑前后对应图像构成的图像对。在构建过程中，作者首先从文本编辑指令出发，生成成对的图像描述。随后再根据这些描述生成对应的成对图像构成训练样本。

1. 生成成对的图像描述

在这一过程中，需要先给定一个图像文本描述，例如“一个女孩骑马的照片”，如上图（a）中所示，随后需要根据该文本描述生成一些合理的编辑指令，例如“让一个女孩骑龙”，更合理一点的描述为“一个女孩骑龙的照片”，这一操作可以通过GPT-3类似的文本大模型完成。需要注意的是，这些操作完全在文本域中进行，这样做可以生成大量的、多样性的编辑指令，同时能够保证图像变化和文本指令之间的对应关系。

具体来说，作者对GPT-3进行了专门的微调，首先收集了一个规模相对较小的人工编辑三元组数据集，三元组包含（1）输入的图像描述，（2）编辑指令，（3）输出的图像描述，数据集详细介绍如下表所示。

首先收集了700条图像描述样本，然后手动编写了编辑指令和输出图像描述，然后使用这700条样本对GPT-3模型进行微调，微调后的模型可以自行生成详细的训练样本，上表非常鲜明的展示了作者手动生成的样本和GPT-3随后生成样本的对比。

在得到成对的编辑指令后，作者使用文本图像模型Stable Diffusion将这两个文本提示（即编辑前和编辑后）转换为一对相应的图像，如上图（b）所示。然而这一过程仍然面临一个重大挑战：目前的文本到图像模型无法保证图像内容身份信息的一致性，即使在输入的条件提示变化非常小的情况下。

例如，我们为模型指定两个非常相似的文本提示：“一张猫的照片”和“一张黑猫的照片”，模型可能会产生两只截然不同的猫的图像，这对本文图像编辑的目的来说是不合理的。为了解决这一问题，作者想到使用这些成对数据来训练模型编辑图像，而不是遵循这些模型原本的生成模式去生成随机图像。

作者使用了最近新提出的Prompt-to-Prompt方法[3]来完成操作，该方法可以针对一个输入文本生成多代近似的图像，且这些图像彼此之间含有相同的身份信息，Prompt-to-Prompt通过在去噪过程中使用交互注意力权重来实现。

InstructPix2Pix的建模本质是从隐空间扩散模型（Latent Diffusion）演变而来，Latent Diffusion通过在带有编码器和解码器的预训练变分自动编码器的隐空间中运行来提高扩散模型的效率和质量。对于一个图像，扩散过程将噪声添加到编码的隐层向量中，产生一个噪声隐变量，其中噪声等级随时间步数而增加。然后训练一个网络，它可以预测在给定的图像条件和文本指令条件下添加到噪声隐变量中的噪声信息，然后通过以下目标函数来优化模型：

之前的工作[4]表明微调大型图像扩散模型往往比从头训练模型以完成图像翻译任务效果更好，尤其是在配对训练数据有限的情况下。因此，本文作者使用预训练的Stable Diffusion对模型进行初始化。为了赋予InstructPix2Pix图像编辑的能力，作者在模型的第一个卷积层中增加了额外的条件输入通道。

为了进一步提高图像生成效果以及模型对输入条件的遵循程度，作者在InstructPix2Pix中也引入了Classifier-free引导策略。Classifier-free扩散引导是一种权衡扩散模型生成的样本质量和多样性的方法。其中隐式分类器会将更高的可能性分配给条件，以提高生成图像的视觉质量并使采样图像更好地与输入条件相符合。Classifier-free引导的训练需要同时联合训练有条件和无条件去噪的扩散模型，并在推理时结合两个分数进行估计。

对于本文的任务，作者设计了一个评分网络，其中有两个条件：输入图像和文本指令。在训练过程中，使InstructPix2Pix能够针对两个或任一条件输入进行有条件或无条件去噪。为此，作者引入了两个指导尺度和，可以对其进行调整以权衡生成的样本与输入图像的遵循程度以及它们与编辑指令的遵循程度，评分网络的分数估计如下：

在下图的“将大卫变成半机械人”的例子中显示了这两个参数对生成样本的影响。控制与输入图像的相似性，而控制与编辑指令的一致性。

基于3d建模多样化

单图生成3d模型

Point-E

Point-E 不输出传统意义上的 3D 图像，它会生成点云，或空间中代表 3D 形状的离散数据点集。Point-E 中的 E 是「效率」的缩写，表示其比以前的 3D 对象生成方法更快。不过从计算的角度来看，点云更容易合成，但它们无法捕获对象的细粒度形状或纹理 —— 这是目前 Point-E 的一个关键限制。

为了解决这一问题，OpenAI 团队训练了一个额外的人工智能系统来将 Point-E 的点云转换为网格。

Point-E 架构及运行原理

在独立的网格生成模型之外，Point-E 主要由两个模型组成：文本到图像模型和图像到 3D 模型。文本到图像模型类似于 OpenAI 自家的 DALL-E 2 和 Stable Diffusion 等生成模型系统，在标记图像上进行训练以理解单词和视觉概念之间的关联。在图像生成之后，图像到 3D 模型被输入一组与 3D 对象配对的图像，训练出在两者之间有效转换的能力。

不是训练单个生成模型，直接生成以文本为条件的点云，而是将生成过程分为三个步骤。

首先，生成一个以文本标题为条件的综合视图。

接下来，生成⼀个基于合成视图的粗略点云（1,024 个点）。

最后，生成了⼀个以低分辨率点云和合成视图为条件的精细点云（4,096 个点）。

在数百万个3D模型上训练模型后，我们发现数据集的数据格式和质量差异很大，这促使我们开发各种后处理步骤，以确保更高的数据质量。

为了将所有的数据转换为⼀种通用格式，我们使用Blender从20个随机摄像机角度，将每个3D模型渲染为RGBAD图像（Blender支持多种3D格式，并带有优化的渲染引擎）。

对于每个模型，Blender脚本都将模型标准化为边界立方体，配置标准照明设置，最后使用Blender的内置实时渲染引擎，导出RGBAD图像。

然后，使用渲染将每个对象转换为彩色点云。首先，通过计算每个RGBAD图像中每个像素的点，来为每个对象构建⼀个密集点云。这些点云通常包含数十万个不均匀分布的点，因此我们还使用最远点采样，来创建均匀的4K点云。

通过直接从渲染构建点云，我们能够避免直接从3D网格中采样可能出现的各种问题，对模型中包含的点进行取样，或处理以不寻常的文件格式存储的三维模型。

最后，我们采用各种启发式方法，来减少数据集中低质量模型的频率。

首先，我们通过计算每个点云的SVD来消除平面对象，只保留那些最小奇异值高于某个阈值的对象。

接下来，我们通过CLIP特征对数据集进行聚类（对于每个对象，我们对所有渲染的特征进行平均）。

我们发现，一些集群包含许多低质量的模型类别，而其他集群则显得更加多样化或可解释。

我们将这些集群分到几个不同质量的bucket中，并使用所得bucket的加权混合作为我们的最终数据集。

多图合成3d模型

Nerf

NeRF 作为 ECCV2020 Best Paper Honorable Mention，影响力巨大。如今各大CV&CG会议中都是此类工作（又一个大坑）。这个 method 使用隐式表达，以 2d posed images 为监督完成 novel view synthesis。通俗来说，就是用多张 2d 图片隐式重建三维场景，其展示的生成效果让人十分震撼。

希望本文能让大家快速了解这项工作~

原文地址：NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
项目主页：NeRF: Neural Radiance Fields

pipeline

NeRF训练管线

首先NeRF的想法是将三维场景隐式存储在神经网络中，我们只需要通过输入一个相机位姿，就可以获得场景图片。NeRF将场景建模成一个连续的5D辐射场（其实感觉就可以理解为隐式的体素描述）。

已知场景中点的位置 (x,y,z) 和观察方向 (θ,ϕ) ，神经网络 FΘ 会输出一个 (c,σ) 表示该方向的自发光颜色 c 和该点体素密度 σ 。然后使用 classical volume rendering 的渲染方程：

C(r)=∫tntfT(t)σ(r(t))c(r(t),d)dt, where T(t)=exp⁡(−∫tntσ(r(s))ds)

The expected color C(r) of camera ray r(t) = o + td with near and far bounds tn and tf
The function T(t) denotes the accumulated transmittance along the ray from tn to t, i.e., the probability that the ray travels from tn to t without hitting any other particle.

我们就可以得到输入相机位姿条件下的视角图片，然后和 ground truth 做损失即可完成可微优化。

其实思路总结下来十分简单：

用 network 存体素信息：(x,y,z,θ,ϕ)→(c,σ)

然后用体素渲染方程获得生成视角图片：光线采样+积分

最后与原视角图片计算损失更新网络

需要注意的是：体素在不同方向的自发光颜色是不一样的（view-dependent），具体效果参考原文 Fig.3 和 Fig.4 。view-dependent 可以表示各向异性的光学属性。

See Fig. 3 for an example of how our method uses the input viewing direction to represent non-Lambertian effects. As shown in Fig. 4, a model trained without view dependence (only x as input) has difficulty representing specularities.
一点疑问：如果体素密度 view-dependent 会怎么样？

Numerical estimation of volume rendering

在上一节我们介绍了NeRF生成图片的规则。注意到体素渲染方程是连续形式，而在实际神经网络训练过程中无法完成的。因此我们需要将其改为离散形式进行近似计算（Quadrature积分法）：

C^(r)=∑i=1NTi(1−exp⁡(−σiδi))ci, where Ti=exp⁡(−∑j=1i−1σjδj)

其中：

ti∼U[tn+i−1N(tf−tn),tn+iN(tf−tn)]

δi=ti+1−ti

通过将视线路径均分成 N 段，然后在每一段均匀地随机采样体素用于渲染计算。

文末给出了离散形式的证明

Hierarchical volume sampling

虽然前文我们使用离散的近似积分来进行体素渲染，但是在场景中难免会存在 free space 和 occluded regions 这种应该对渲染结果无贡献的区间，或者说没有意义的。为了更有效地采样，本文采用分级表征渲染的思想[1]提高渲染效率，即通过同时优化两个神经网络："coarse"和"fine"。

作者先使用分层采样得到 Nc 个点，通过 coarse 的渲染方程的计算：

C^c(r)=∑i=1Ncwici,wi=Ti(1−exp⁡(−σiδi))

对 ωi 进行归一化 w^i=wi/∑j=1Ncwj 得到分段常数概率密度函数，然后通过逆变换采样（inverse transform sampling）获得 Nf 个点，添加至原 Nc 个点中用于 fine 渲染。

逆变换采样：在分布 p 的 CDF 值域上均匀采样与原分布 p 中的采样同分布

通过第二次采样，我们所得到的采样点则会更多得使用对颜色计算贡献更有意义的体素进行计算。

最后的损失函数如下：

L=∑r∈R[‖C^c(r)−C(r)‖22+‖C^f(r)−C(r)‖22]

Positional encoding

虽然前文描述的“隐式表示+体素渲染”十分美好，但是我们通过下图（No Position Encoding）可知它的生成图片十分模糊（可以理解为一种高频信息丢失）。前人工作[2]指出神经网络倾向于学习低频信息，而NeRF需要重建高清的场景（对场景overfitting），所以我们需要让模型关注场景的高频细节。于是我们将位置向量 x 和方向向量 d 转化为高频变量：

γ(p)=(sin⁡(20πp),cos⁡(20πp),⋯,sin⁡(2L−1πp),cos⁡(2L−1πp))

可以理解为，这种表示方法，即便两个点在原空间中的距离很近，很难分辨，但通过Positional encoding后，我们还是可以很轻松的分辨两个点！

In our experiments, we set L = 10 for γ(x) and L = 4 for γ(d).

然后将Positional encoding后的 (x,y,z) 和 (θ,ϕ) 作为输入就可以生成更加清晰的图片。

合图多样化

Demo

文本生成图片

from diffusers import StableDiffusionPipeline
import torchmodel_id = "dreamlike-art/dreamlike-photoreal-2.0"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")prompt = "blender 3d model, a stand full body cute fluen ant doll with blue and white eyes of technology style,bright cinematic lighting, gopro, fisheye lens  closeup,highly detailed, digital painting, artstation, concept art, smooth,  volumetric light"
image = pipe(prompt).images[0]image.save("./result.jpg")

" a ant doll , blue and white eyes,technology style "

prompt文本增加多样性

import argparse
import hashlib
import itertools
import math
import os
from pathlib import Path
from typing import Optionalimport torch
import torch.nn.functional as F
import torch.utils.checkpoint
from torch.utils.data import Datasetfrom accelerate import Accelerator
from accelerate.logging import get_logger
from accelerate.utils import set_seed
from diffusers import AutoencoderKL, DDPMScheduler, StableDiffusionPipeline, UNet2DConditionModel
from diffusers.optimization import get_scheduler
from huggingface_hub import HfFolder, Repository, whoami
from PIL import Image
from torchvision import transforms
from tqdm.auto import tqdm
from transformers import CLIPTextModel, CLIPTokenizerlogger = get_logger(__name__)def parse_args(input_args=None):parser = argparse.ArgumentParser(description="Simple example of a training script.")parser.add_argument("--pretrained_model_name_or_path",type=str,default=None,required=True,help="Path to pretrained model or model identifier from huggingface.co/models.",)parser.add_argument("--revision",type=str,default=None,required=False,help="Revision of pretrained model identifier from huggingface.co/models.",)parser.add_argument("--tokenizer_name",type=str,default=None,help="Pretrained tokenizer name or path if not the same as model_name",)parser.add_argument("--instance_data_dir",type=str,default=None,required=True,help="A folder containing the training data of instance images.",)parser.add_argument("--class_data_dir",type=str,default=None,required=False,help="A folder containing the training data of class images.",)parser.add_argument("--instance_prompt",type=str,default=None,help="The prompt with identifier specifying the instance",)parser.add_argument("--class_prompt",type=str,default=None,help="The prompt to specify images in the same class as provided instance images.",)parser.add_argument("--with_prior_preservation",default=False,action="store_true",help="Flag to add prior preservation loss.",)parser.add_argument("--prior_loss_weight", type=float, default=1.0, help="The weight of prior preservation loss.")parser.add_argument("--num_class_images",type=int,default=100,help=("Minimal class images for prior preservation loss. If not have enough images, additional images will be"" sampled with class_prompt."),)parser.add_argument("--output_dir",type=str,default="text-inversion-model",help="The output directory where the model predictions and checkpoints will be written.",)parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.")parser.add_argument("--resolution",type=int,default=512,help=("The resolution for input images, all the images in the train/validation dataset will be resized to this"" resolution"),)parser.add_argument("--center_crop", action="store_true", help="Whether to center crop images before resizing to resolution")parser.add_argument("--use_filename_as_label", action="store_true", help="Uses the filename as the image labels instead of the instance_prompt, useful for regularization when training for styles with wide image variance")parser.add_argument("--use_txt_as_label", action="store_true", help="Uses the filename.txt file's content as the image labels instead of the instance_prompt, useful for regularization when training for styles with wide image variance")parser.add_argument("--train_text_encoder", action="store_true", help="Whether to train the text encoder")parser.add_argument("--train_batch_size", type=int, default=4, help="Batch size (per device) for the training dataloader.")parser.add_argument("--sample_batch_size", type=int, default=4, help="Batch size (per device) for sampling images.")parser.add_argument("--num_train_epochs", type=int, default=1)parser.add_argument("--max_train_steps",type=int,default=None,help="Total number of training steps to perform.  If provided, overrides num_train_epochs.",)parser.add_argument("--gradient_accumulation_steps",type=int,default=1,help="Number of updates steps to accumulate before performing a backward/update pass.",)parser.add_argument("--gradient_checkpointing",action="store_true",help="Whether or not to use gradient checkpointing to save memory at the expense of slower backward pass.",)parser.add_argument("--learning_rate",type=float,default=5e-6,help="Initial learning rate (after the potential warmup period) to use.",)parser.add_argument("--scale_lr",action="store_true",default=False,help="Scale the learning rate by the number of GPUs, gradient accumulation steps, and batch size.",)parser.add_argument("--lr_scheduler",type=str,default="constant",help=('The scheduler type to use. Choose between ["linear", "cosine", "cosine_with_restarts", "polynomial",'' "constant", "constant_with_warmup"]'),)parser.add_argument("--lr_warmup_steps", type=int, default=500, help="Number of steps for the warmup in the lr scheduler.")parser.add_argument("--use_8bit_adam", action="store_true", help="Whether or not to use 8-bit Adam from bitsandbytes.")parser.add_argument("--adam_beta1", type=float, default=0.9, help="The beta1 parameter for the Adam optimizer.")parser.add_argument("--adam_beta2", type=float, default=0.999, help="The beta2 parameter for the Adam optimizer.")parser.add_argument("--adam_weight_decay", type=float, default=1e-2, help="Weight decay to use.")parser.add_argument("--adam_epsilon", type=float, default=1e-08, help="Epsilon value for the Adam optimizer")parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")parser.add_argument("--push_to_hub", action="store_true", help="Whether or not to push the model to the Hub.")parser.add_argument("--hub_token", type=str, default=None, help="The token to use to push to the Model Hub.")parser.add_argument("--hub_model_id",type=str,default=None,help="The name of the repository to keep in sync with the local `output_dir`.",)parser.add_argument("--logging_dir",type=str,default="logs",help=("[TensorBoard](https://www.tensorflow.org/tensorboard) log directory. Will default to"" *output_dir/runs/**CURRENT_DATETIME_HOSTNAME***."),)parser.add_argument("--log_with",type=str,default="tensorboard",choices=["tensorboard", "wandb"])parser.add_argument("--mixed_precision",type=str,default="no",choices=["no", "fp16", "bf16"],help=("Whether to use mixed precision. Choose""between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10.""and an Nvidia Ampere GPU."),)parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")parser.add_argument("--save_model_every_n_steps", type=int)parser.add_argument("--auto_test_model", action="store_true", help="Whether or not to automatically test the model after saving it")parser.add_argument("--test_prompt", type=str, default="A photo of a cat", help="The prompt to use for testing the model.")parser.add_argument("--test_prompts_file", type=str, default=None, help="The file containing the prompts to use for testing the model.example: test_prompts.txt, each line is a prompt")parser.add_argument("--test_negative_prompt", type=str, default="", help="The negative prompt to use for testing the model.")parser.add_argument("--test_seed", type=int, default=42, help="The seed to use for testing the model.")parser.add_argument("--test_num_per_prompt", type=int, default=1, help="The number of images to generate per prompt.")if input_args is not None:args = parser.parse_args(input_args)else:args = parser.parse_args()env_local_rank = int(os.environ.get("LOCAL_RANK", -1))if env_local_rank != -1 and env_local_rank != args.local_rank:args.local_rank = env_local_rankif args.instance_data_dir is None:raise ValueError("You must specify a train data directory.")if args.with_prior_preservation:if args.class_data_dir is None:raise ValueError("You must specify a data directory for class images.")if args.class_prompt is None:raise ValueError("You must specify prompt for class images.")return args# turns a path into a filename without the extension
def get_filename(path):return path.stemdef get_label_from_txt(path):txt_path = path.with_suffix(".txt") # get the path to the .txt fileif txt_path.exists():with open(txt_path, "r") as f:return f.read()else:return ""class DreamBoothDataset(Dataset):"""A dataset to prepare the instance and class images with the prompts for fine-tuning the model.It pre-processes the images and the tokenizes prompts."""def __init__(self,instance_data_root,instance_prompt,tokenizer,class_data_root=None,class_prompt=None,size=512,center_crop=False,use_filename_as_label=False,use_txt_as_label=False,):self.size = sizeself.center_crop = center_cropself.tokenizer = tokenizerself.instance_data_root = Path(instance_data_root)if not self.instance_data_root.exists():raise ValueError("Instance images root doesn't exists.")self.instance_images_path = list(self.instance_data_root.glob("*.jpg")) + list(self.instance_data_root.glob("*.png"))self.num_instance_images = len(self.instance_images_path)self.instance_prompt = instance_promptself.use_filename_as_label = use_filename_as_labelself.use_txt_as_label = use_txt_as_labelself._length = self.num_instance_imagesif class_data_root is not None:self.class_data_root = Path(class_data_root)self.class_data_root.mkdir(parents=True, exist_ok=True)self.class_images_path = list(self.class_data_root.glob("*.jpg")) + list(self.class_data_root.glob("*.png"))self.num_class_images = len(self.class_images_path)self._length = max(self.num_class_images, self.num_instance_images)self.class_prompt = class_promptelse:self.class_data_root = Noneself.image_transforms = transforms.Compose([transforms.Resize(size, interpolation=transforms.InterpolationMode.BILINEAR),transforms.CenterCrop(size) if center_crop else transforms.RandomCrop(size),transforms.ToTensor(),transforms.Normalize([0.5], [0.5]),])def __len__(self):return self._lengthdef __getitem__(self, index):example = {}path = self.instance_images_path[index % self.num_instance_images]prompt = get_filename(path) if self.use_filename_as_label else self.instance_promptprompt = get_label_from_txt(path) if self.use_txt_as_label else promptprint("prompt", prompt)instance_image = Image.open(path)if not instance_image.mode == "RGB":instance_image = instance_image.convert("RGB")example["instance_images"] = self.image_transforms(instance_image)example["instance_prompt_ids"] = self.tokenizer(prompt,padding="do_not_pad",truncation=True,max_length=self.tokenizer.model_max_length,).input_idsif self.class_data_root:class_image = Image.open(self.class_images_path[index % self.num_class_images])if not class_image.mode == "RGB":class_image = class_image.convert("RGB")example["class_images"] = self.image_transforms(class_image)example["class_prompt_ids"] = self.tokenizer(self.class_prompt,padding="do_not_pad",truncation=True,max_length=self.tokenizer.model_max_length,).input_idsreturn exampleclass PromptDataset(Dataset):"A simple dataset to prepare the prompts to generate class images on multiple GPUs."def __init__(self, prompt, num_samples):self.prompt = promptself.num_samples = num_samplesdef __len__(self):return self.num_samplesdef __getitem__(self, index):example = {}example["prompt"] = self.promptexample["index"] = indexreturn exampledef get_full_repo_name(model_id: str, organization: Optional[str] = None, token: Optional[str] = None):if token is None:token = HfFolder.get_token()if organization is None:username = whoami(token)["name"]return f"{username}/{model_id}"else:return f"{organization}/{model_id}"def test_model(folder, args):if args.test_prompts_file is not None:with open(args.test_prompts_file, "r") as f:prompts = f.read().splitlines()else:prompts = [args.test_prompt]test_path = os.path.join(folder, "test")if not os.path.exists(test_path):os.makedirs(test_path)print("Testing the model...")from diffusers import DDIMSchedulerdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")torch_dtype = torch.float16 if device.type == "cuda" else torch.float32pipeline = StableDiffusionPipeline.from_pretrained(folder,torch_dtype=torch_dtype,safety_checker=None,load_in_8bit=True,scheduler = DDIMScheduler(beta_start=0.00085,beta_end=0.012,beta_schedule="scaled_linear",clip_sample=False,set_alpha_to_one=False,),)pipeline.set_progress_bar_config(disable=True)pipeline.enable_attention_slicing()pipeline = pipeline.to(device)torch.manual_seed(args.test_seed)with torch.autocast('cuda'): for prompt in prompts:print(f"Generating test images for prompt: {prompt}")test_images = pipeline(prompt=prompt,width=512,height=512,negative_prompt=args.test_negative_prompt,num_inference_steps=30, num_images_per_prompt=args.test_num_per_prompt,).imagesfor index, image in enumerate(test_images):image.save(f"{test_path}/{prompt}_{index}.png")del pipelineif torch.cuda.is_available():torch.cuda.empty_cache()print(f"Test completed.The examples are saved in {test_path}")def save_model(accelerator, unet, text_encoder, args, step=None):unet = accelerator.unwrap_model(unet) text_encoder = accelerator.unwrap_model(text_encoder)if step == None:folder = args.output_direlse:folder = args.output_dir + "-Step-" + str(step)print("Saving Model Checkpoint...")print("Directory: " + folder)# Create the pipeline using using the trained modules and save it.if accelerator.is_main_process:pipeline = StableDiffusionPipeline.from_pretrained(args.pretrained_model_name_or_path,unet=unet,text_encoder=text_encoder,revision=args.revision,)pipeline.save_pretrained(folder)del pipelineif torch.cuda.is_available():torch.cuda.empty_cache()if args.auto_test_model:print("Testing Model...")test_model(folder, args)if args.push_to_hub:repo.push_to_hub(commit_message="End of training", blocking=False, auto_lfs_prune=True)def main(args):logging_dir = Path(args.logging_dir)accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps,mixed_precision=args.mixed_precision,log_with=args.log_with,logging_dir=logging_dir,)# Currently, it's not possible to do gradient accumulation when training two models with accelerate.accumulate# This will be enabled soon in accelerate. For now, we don't allow gradient accumulation when training two models.# TODO (patil-suraj): Remove this check when gradient accumulation with two models is enabled in accelerate.if args.train_text_encoder and args.gradient_accumulation_steps > 1 and accelerator.num_processes > 1:raise ValueError("Gradient accumulation is not supported when training the text encoder in distributed training. ""Please set gradient_accumulation_steps to 1. This feature will be supported in the future.")if args.seed is not None:set_seed(args.seed)if args.with_prior_preservation:class_images_dir = Path(args.class_data_dir)if not class_images_dir.exists():class_images_dir.mkdir(parents=True)cur_class_images = len(list(class_images_dir.iterdir()))if cur_class_images < args.num_class_images:torch_dtype = torch.float16 if accelerator.device.type == "cuda" else torch.float32pipeline = StableDiffusionPipeline.from_pretrained(args.pretrained_model_name_or_path,torch_dtype=torch_dtype,safety_checker=None,revision=args.revision,)pipeline.set_progress_bar_config(disable=True)num_new_images = args.num_class_images - cur_class_imageslogger.info(f"Number of class images to sample: {num_new_images}.")sample_dataset = PromptDataset(args.class_prompt, num_new_images)sample_dataloader = torch.utils.data.DataLoader(sample_dataset, batch_size=args.sample_batch_size)sample_dataloader = accelerator.prepare(sample_dataloader)pipeline.to(accelerator.device)for example in tqdm(sample_dataloader, desc="Generating class images", disable=not accelerator.is_local_main_process):images = pipeline(example["prompt"]).imagesfor i, image in enumerate(images):hash_image = hashlib.sha1(image.tobytes()).hexdigest()image_filename = class_images_dir / f"{example['index'][i] + cur_class_images}-{hash_image}.jpg"image.save(image_filename)del pipelineif torch.cuda.is_available():torch.cuda.empty_cache()# Handle the repository creationif accelerator.is_main_process:if args.push_to_hub:if args.hub_model_id is None:repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token)else:repo_name = args.hub_model_idrepo = Repository(args.output_dir, clone_from=repo_name)with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore:if "step_*" not in gitignore:gitignore.write("step_*\n")if "epoch_*" not in gitignore:gitignore.write("epoch_*\n")elif args.output_dir is not None:os.makedirs(args.output_dir, exist_ok=True)# Load the tokenizerif args.tokenizer_name:tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_name,revision=args.revision,)elif args.pretrained_model_name_or_path:tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path,subfolder="tokenizer",revision=args.revision,)# Load models and create wrapper for stable diffusiontext_encoder = CLIPTextModel.from_pretrained(args.pretrained_model_name_or_path,subfolder="text_encoder",revision=args.revision,)vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path,subfolder="vae",revision=args.revision,)unet = UNet2DConditionModel.from_pretrained(args.pretrained_model_name_or_path,subfolder="unet",revision=args.revision,)vae.requires_grad_(False)if not args.train_text_encoder:text_encoder.requires_grad_(False)if args.gradient_checkpointing:unet.enable_gradient_checkpointing()if args.train_text_encoder:text_encoder.gradient_checkpointing_enable()if args.scale_lr:args.learning_rate = (args.learning_rate * args.gradient_accumulation_steps * args.train_batch_size * accelerator.num_processes)# Use 8-bit Adam for lower memory usage or to fine-tune the model in 16GB GPUsif args.use_8bit_adam:try:import bitsandbytes as bnbexcept ImportError:raise ImportError("To use 8-bit Adam, please install the bitsandbytes library: `pip install bitsandbytes`.")optimizer_class = bnb.optim.AdamW8bitelse:optimizer_class = torch.optim.AdamWparams_to_optimize = (itertools.chain(unet.parameters(), text_encoder.parameters()) if args.train_text_encoder else unet.parameters())optimizer = optimizer_class(params_to_optimize,lr=args.learning_rate,betas=(args.adam_beta1, args.adam_beta2),weight_decay=args.adam_weight_decay,eps=args.adam_epsilon,)noise_scheduler = DDPMScheduler.from_config(args.pretrained_model_name_or_path, subfolder="scheduler")train_dataset = DreamBoothDataset(instance_data_root=args.instance_data_dir,instance_prompt=args.instance_prompt,class_data_root=args.class_data_dir if args.with_prior_preservation else None,class_prompt=args.class_prompt,tokenizer=tokenizer,size=args.resolution,center_crop=args.center_crop,use_filename_as_label=args.use_filename_as_label,use_txt_as_label=args.use_txt_as_label,)def collate_fn(examples):input_ids = [example["instance_prompt_ids"] for example in examples]pixel_values = [example["instance_images"] for example in examples]# Concat class and instance examples for prior preservation.# We do this to avoid doing two forward passes.if args.with_prior_preservation:input_ids += [example["class_prompt_ids"] for example in examples]pixel_values += [example["class_images"] for example in examples]pixel_values = torch.stack(pixel_values)pixel_values = pixel_values.to(memory_format=torch.contiguous_format).float()input_ids = tokenizer.pad({"input_ids": input_ids}, padding=True, return_tensors="pt").input_idsbatch = {"input_ids": input_ids,"pixel_values": pixel_values,}return batchtrain_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=args.train_batch_size, shuffle=True, collate_fn=collate_fn, num_workers=1)# Scheduler and math around the number of training steps.overrode_max_train_steps = Falsenum_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)if args.max_train_steps is None:args.max_train_steps = args.num_train_epochs * num_update_steps_per_epochoverrode_max_train_steps = Truelr_scheduler = get_scheduler(args.lr_scheduler,optimizer=optimizer,num_warmup_steps=args.lr_warmup_steps * args.gradient_accumulation_steps,num_training_steps=args.max_train_steps * args.gradient_accumulation_steps,)if args.train_text_encoder:unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(unet, text_encoder, optimizer, train_dataloader, lr_scheduler)else:unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(unet, optimizer, train_dataloader, lr_scheduler)weight_dtype = torch.float32if args.mixed_precision == "fp16":weight_dtype = torch.float16elif args.mixed_precision == "bf16":weight_dtype = torch.bfloat16# Move text_encode and vae to gpu.# For mixed precision training we cast the text_encoder and vae weights to half-precision# as these models are only used for inference, keeping weights in full precision is not required.vae.to(accelerator.device, dtype=weight_dtype)if not args.train_text_encoder:text_encoder.to(accelerator.device, dtype=weight_dtype)# We need to recalculate our total training steps as the size of the training dataloader may have changed.num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)if overrode_max_train_steps:args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch# Afterwards we recalculate our number of training epochsargs.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)# We need to initialize the trackers we use, and also store our configuration.# The trackers initializes automatically on the main process.if accelerator.is_main_process:accelerator.init_trackers("dreambooth", config=vars(args))# Train!total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_stepslogger.info("***** Running training *****")logger.info(f"  Num examples = {len(train_dataset)}")logger.info(f"  Num batches each epoch = {len(train_dataloader)}")logger.info(f"  Num Epochs = {args.num_train_epochs}")logger.info(f"  Instantaneous batch size per device = {args.train_batch_size}")logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")logger.info(f"  Total optimization steps = {args.max_train_steps}")# Only show the progress bar once on each machine.progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)progress_bar.set_description("Steps")global_step = 0for epoch in range(args.num_train_epochs):unet.train()if args.train_text_encoder:text_encoder.train()for step, batch in enumerate(train_dataloader):with accelerator.accumulate(unet):# Convert images to latent spacelatents = vae.encode(batch["pixel_values"].to(dtype=weight_dtype)).latent_dist.sample()latents = latents * 0.18215# Sample noise that we'll add to the latentsnoise = torch.randn_like(latents)bsz = latents.shape[0]# Sample a random timestep for each imagetimesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (bsz,), device=latents.device)timesteps = timesteps.long()# Add noise to the latents according to the noise magnitude at each timestep# (this is the forward diffusion process)noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)# Get the text embedding for conditioningencoder_hidden_states = text_encoder(batch["input_ids"])[0]# Predict the noise residualnoise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sampleif args.with_prior_preservation:# Chunk the noise and noise_pred into two parts and compute the loss on each part separately.noise_pred, noise_pred_prior = torch.chunk(noise_pred, 2, dim=0)noise, noise_prior = torch.chunk(noise, 2, dim=0)# Compute instance lossloss = F.mse_loss(noise_pred.float(), noise.float(), reduction="none").mean([1, 2, 3]).mean()# Compute prior lossprior_loss = F.mse_loss(noise_pred_prior.float(), noise_prior.float(), reduction="mean")# Add the prior loss to the instance loss.loss = loss + args.prior_loss_weight * prior_losselse:loss = F.mse_loss(noise_pred.float(), noise.float(), reduction="mean")accelerator.backward(loss)if accelerator.sync_gradients:params_to_clip = (itertools.chain(unet.parameters(), text_encoder.parameters())if args.train_text_encoderelse unet.parameters())accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)optimizer.step()lr_scheduler.step()optimizer.zero_grad()# Checks if the accelerator has performed an optimization step behind the scenesif accelerator.sync_gradients:progress_bar.update(1)global_step += 1logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0]}progress_bar.set_postfix(**logs)accelerator.log(logs, step=global_step)if global_step >= args.max_train_steps:breakif args.save_model_every_n_steps != None and (global_step % args.save_model_every_n_steps) == 0:save_model(accelerator, unet, text_encoder, args, global_step)accelerator.wait_for_everyone()save_model(accelerator, unet, text_encoder, args, step=None)accelerator.end_training()if __name__ == "__main__":args = parse_args()main(args)

# 用于训练特定物体/人物的方法（只需单一标签）
export MODEL_NAME="./model"
export INSTANCE_DIR="./datasets/test2"
export OUTPUT_DIR="./new_model"
export CLASS_DIR="./datasets/class" # 用于存放模型生成的先验知识的图片文件夹，请勿改动
export LOG_DIR="/root/tf-logs"
export TEST_PROMPTS_FILE="./test_prompts_object.txt"rm -rf $CLASS_DIR/* # 如果你要训练与上次不同的特定物体/人物，需要先清空该文件夹。其他时候可以注释掉这一行（前面加#）
rm -rf $LOG_DIR/*accelerate launch tools/train_dreambooth.py \--train_text_encoder \--pretrained_model_name_or_path=$MODEL_NAME  \--mixed_precision="fp16" \--instance_data_dir=$INSTANCE_DIR \--instance_prompt="a photo of <xxx> dog" \--with_prior_preservation --prior_loss_weight=1.0 \--class_prompt="a photo of dog" \--class_data_dir=$CLASS_DIR \--num_class_images=200 \--output_dir=$OUTPUT_DIR \--logging_dir=$LOG_DIR \--center_crop \--resolution=512 \--train_batch_size=1 \--gradient_accumulation_steps=1 --gradient_checkpointing \--use_8bit_adam \--learning_rate=2e-6 \--lr_scheduler="constant" \--lr_warmup_steps=0 \--auto_test_model \--test_prompts_file=$TEST_PROMPTS_FILE \--test_seed=123 \--test_num_per_prompt=3 \--max_train_steps=1000 \--save_model_every_n_steps=500 # 如果max_train_steps改大了，请记得把save_model_every_n_steps也改大
# 不然磁盘很容易中间就满了# 以下是核心参数介绍：
# 主要的几个
# --train_text_encoder 训练文本编码器
# --mixed_precision="fp16" 混合精度训练
# - center_crop
# 是否裁剪图片，一般如果你的数据集不是正方形的话，需要裁剪
# - resolution
# 图片的分辨率，一般是512，使用该参数会自动缩放输入图像
# 可以配合center_crop使用，达到裁剪成正方形并缩放到512*512的效果
# - instance_prompt
# 如果你希望训练的是特定的人物，使用该参数
# 如 --instance_prompt="a photo of <xxx> girl"
# - class_prompt
# 如果你希望训练的是某个特定的类别，使用该参数可能提升一定的训练效果
# - use_txt_as_label
# 是否读取与图片同名的txt文件作为label
# 如果你要训练的是整个大模型的图像风格，那么可以使用该参数
# 该选项会忽略instance_prompt参数传入的内容
# - learning_rate
# 学习率，一般是2e-6，是训练中需要调整的关键参数
# 太大会导致模型不收敛，太小的话，训练速度会变慢
# - lr_scheduler, 可选项有constant, linear, cosine, cosine_with_restarts, cosine_with_hard_restarts
# 学习率调整策略，一般是constant，即不调整，如果你的数据集很大，可以尝试其他的，但是可能会导致模型不收敛，需要调整学习率
# - lr_warmup_steps，如果你使用的是constant，那么这个参数可以忽略，
# 如果使用其他的，那么这个参数可以设置为0，即不使用warmup
# 也可以设置为其他的值，比如1000，即在前1000个step中，学习率从0慢慢增加到learning_rate的值
# 一般不需要设置, 除非你的数据集很大，训练收敛很慢
# - max_train_steps
# 训练的最大步数，一般是1000，如果你的数据集比较大，那么可以适当增大该值
# - save_model_every_n_steps
# 每多少步保存一次模型，方便查看中间训练的结果找出最优的模型，也可以用于断点续训# --with_prior_preservation，--prior_loss_weight=1.0，分别是使用先验知识保留和先验损失权重
# 如果你的数据样本比较少，那么可以使用这两个参数，可以提升训练效果，还可以防止过拟合（即生成的图片与训练的图片相似度过高）# --auto_test_model, --test_prompts_file, --test_seed, --test_num_per_prompt
# 分别是自动测试模型（每save_model_every_n_steps步后）、测试的文本、随机种子、每个文本测试的次数
# 测试的样本图片会保存在模型输出目录下的test文件夹中

from diffusers import StableDiffusionPipeline
import torch
from diffusers import DDIMSchedulermodel_path = "./new_model"
prompt = "a cute girl, blue eyes, brown hair"
torch.manual_seed(123123123)pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16,scheduler=DDIMScheduler(beta_start=0.00085,beta_end=0.012,beta_schedule="scaled_linear",clip_sample=False,set_alpha_to_one=True,),safety_checker=None)# def dummy(images, **kwargs):
#     return images, False
# pipe.safety_checker = dummy
pipe = pipe.to("cuda")
images = pipe(prompt, width=512, height=512, num_inference_steps=30, num_images_per_prompt=3).images
for i, image in enumerate(images):image.save(f"test-{i}.png")

Point-E单图建模增加多样性

blender修改颜色、光照、纹理

InstructPix2Pix文本增加多样性效果

参考：

[1]https://imagen.research.google/

[2]https://twitter.com/ai__pub/status/1561362542487695360

[3]https://stability.ai/blog/stable-diffusion-announcement

[4]https://arxiv.org/abs/2112.10752

[5]https://dreambooth.github.io/

[6]https://twitter.com/natanielruizg/status/1563166568195821569

[7]https://natanielruiz.github.io/

[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

[9] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.

[10] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel CohenOr. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.

[11] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.

[12]https://techcrunch.com/2022/12/20/openai-releases-point-e-an-ai-that-generates-3d-models/?tpcc=tcplustwitter

[13]https://www.engadget.com/openai-releases-point-e-dall-e-3d-text-modeling-210007892.html?src=rss