DeepDream网络

它使用DeepDreaming上的tensorflow入门将Christian Dittmar和Stefan Balke的适用于Caffe的DeepDreamEffect从HAMR2015改编为tensorflow。另外，编辑了在卷积图中诱发幻觉的损失函数，从而忽略了频谱图的高能量区域，从而避免了重新合成时的失真并为效果增加了音乐性。

本质上，这种黑客将音频频谱图转换为图像，然后可以通过预先训练的卷积神经网络（在ImageNet上训练的Inception v3）的特定层对其进行处理，然后将其重新合成为音频。

Google 于 2014 年在 ImageNet 大型视觉识别竞赛（ILSVRC）训练了一个神经网络，并于 2015 年 7 月开放源代码。

该网络学习了每张图片的表示。低层学习低级特征，比如线条和边缘，而高层学习更复杂的模式，比如眼睛、鼻子、嘴巴等。因此，如果试图在网络中表示更高层次的特征，我们会看到从原始 ImageNet 中提取的不同特征的组合，例如鸟的眼睛和狗的嘴巴。

考虑到这一点，如果拍摄一张新的图片，并尝试最大化与网络高层的相似性，那么结果会得到一张新的视觉体验的图片。在这张新视觉体验的图片中，由高层学习的一些模式如同是原始图像的梦境一般。下图是两张想象图片的例子：

1.准备工作

从网上下载预训练的 Inception 模型（https://github.com/martinwicke/tensorflow-tutorial/blob/master/tensorflow_inception_graph.pb）。

2.在ipython上实现

tf2.0版本源码可能不兼容所以做了这样的处理import tensorflow.compat.v1 as tfc

from __future__ import print_function
import os
from io import BytesIO
import numpy as np
from functools import partial
import PIL.Image
from IPython.display import clear_output, Image, display, HTML
import IPython.display
import tensorflow as tf
import tensorflow.compat.v1 as tfc

3.加载预训练的图和权值

model_fn = 'tensorflow_inception_graph.pb'# 创建TensorFlow会话并加载模型
graph = tfc.Graph()
sess = tfc.InteractiveSession(graph=graph)
with tfc.gfile.FastGFile(model_fn, 'rb') as f:graph_def = tfc.GraphDef()graph_def.ParseFromString(f.read())
t_input = tfc.placeholder(np.float32, name='input') # 定义输出tensor
imagenet_mean = 117.0
t_preprocessed = tf.expand_dims(t_input-imagenet_mean, 0)
tf.import_graph_def(graph_def, {'input':t_preprocessed})

4.过滤器可视化

为了了解网络学会识别的模式，我们将尝试生成最大化特定神经网络特定卷积层的特定过滤器激活的总和的图像。我们研究的网络包含许多卷积层，每个层输出几十到几百个过滤器，因此我们有很多模式要研究。让我们以一种原始的方式来想象这些。图像空间梯度上升!

# Picking some internal layer. Note that we use outputs before applying the ReLU nonlinearity
# 选择一些隐藏层。注意，我们在应用ReLU非线性之前使用输出
# to have non-zero gradients for features with negative initial activations.
# 对于初始激活为负的特征，具有非零梯度。
layer = 'mixed4d_3x3_bottleneck_pre_relu'
channel = 139 # 选择一些特征通道进行可视化# start with a gray image with a little noise
# 从一个有一点噪点的灰色图像开始
img_noise = np.random.uniform(size=(224,224,3)) + 100.0def showarray(a, fmt='jpeg'):a = np.uint8(np.clip(a, 0, 1)*255)f = BytesIO()PIL.Image.fromarray(a).save(f, fmt)display(Image(data=f.getvalue()))def visstd(a, s=0.1):'''Normalize the image range for visualization'''return (a-a.mean())/max(a.std(), 1e-4)*s + 0.5def T(layer):'''Helper for getting layer output tensor'''return graph.get_tensor_by_name("import/%s:0"%layer)def render_naive(t_obj, img0=img_noise, iter_n=20, step=1.0):img = img0.copy()# defining the optimization objective: maximizing the mean of the filter weights# 定义优化目标:使滤波器权值的均值最大化t_score = tf.reduce_mean(t_obj)t_grad = tf.gradients(t_score, t_input)[0] # behold the power of automatic differentiation!thing = 0for i in range(iter_n):g, score = sess.run([t_grad, t_score], {t_input:img})# normalizing the gradient, so the same step size should work # 归一化的梯度，步长相同g /= g.std()+1e-8         # for different layers and networksimg += g*stepprint(score, end = ' ')showarray(visstd(img))#clear_output()showarray(visstd(img))render_naive(T(layer)[:,:,:,channel])

-19.692852

-28.732038

21.952808

83.23411

150.82277

216.71257

273.70526

324.83408

374.3172

431.97556

481.72006

520.34644

555.6622

592.3075

628.17377

661.4684

683.7716

710.0197

736.5903

754.4731

5.多尺度图像生成

我们将在多个尺度上应用梯度上升。在小尺度上形成的细节将被放大，并在下一个尺度上增加额外的细节。随着多尺度图像的生成，人们可能倾向于将八度的数量设置为一些高值，以生成墙纸大小的图像。在这种情况下，存储网络激活和支持值将很快耗尽GPU内存。有一个简单的技巧可以避免这种情况:将图像分割成更小的块，并独立地计算每个块的梯度。在每次迭代之前对图像应用随机移位有助于避免平铺接缝，提高整体图像质量。

# tffunc and resize just use tensorflow operations to resize an image img
# tffunc和resize只使用tensorflow操作来调整图像img的大小
def tffunc(*argtypes):'''Helper that transforms TF-graph generating function into a regular one.See "resize" function below.'''placeholders = list(map(tfc.placeholder, argtypes)) # map all argtypes to a placeholderdef wrap(f):out = f(*placeholders)def wrapper(*args, **kw):return out.eval(dict(zip(placeholders, args)), session=kw.get('session'))return wrapperreturn wrap# Helper function that uses TF to resize an image
# 使用TF调整图像大小的辅助函数
def resize(img, size):img = tf.expand_dims(img, 0)return tfc.image.resize_bilinear(img, size)[0,:,:,:]
resize = tffunc(np.float32, np.int32)(resize)def calc_grad_tiled(img, t_grad, tile_size=512):'''以平铺的方式计算图像上张量t_grad的值。随机移动应用于图像模糊的瓷砖边界多次迭代.'''sz = tile_sizeh, w = img.shape[:2]sx, sy = np.random.randint(sz, size=2)img_shift = np.roll(np.roll(img, sx, axis=1), sy, axis=0)grad = np.zeros_like(img)for y in range(0, max(h-sz//2, sz),sz):for x in range(0, max(w-sz//2, sz),sz):sub = img_shift[y:y+sz,x:x+sz]g = sess.run(t_grad, {t_input:sub})grad[y:y+sz,x:x+sz] = greturn np.roll(np.roll(grad, -sx, 1), -sy, 0)

def render_multiscale(t_obj, img0=img_noise, iter_n=10, step=1.0, octave_n=3, octave_scale=1.4):# defining the optimization objective: maximizing the mean of the input with values within 20% of max chopped to 0# 定义优化目标:使输入值的平均值在最大截断值的20%以内达到最大值t_obj_scaled = tf.multiply (t_obj, tfc.to_float(tfc.log(t_obj) < .8*tf.reduce_max(t_obj))) # and then log'dt_score = tf.reduce_mean(t_obj_scaled) # 图像通过了滤波权值t_grad = tf.gradients(t_score, t_input)[0] # behold the power of automatic differentiation!img = img0.copy()for octave in range(octave_n):if octave>0:hw = np.float32(img.shape[:2])*octave_scaleimg = resize(img, np.int32(hw))for i in range(iter_n):g = calc_grad_tiled(img, t_grad)# normalizing the gradient, so the same step size should work # 归一化的梯度，步长相同g /= g.std()+1e-8         # for different layers and networksimg += g*stepprint('.', end = ' ')clear_output()showarray(visstd(img))render_multiscale(T(layer)[:,:,:,channel])

7.拉普拉斯金字塔梯度归一化

这看起来更好，但结果图像大多包含高频。我们能改进它吗?一种方法是在优化目标中加入先验的平滑度。这将有效地在每次迭代中使图像模糊一点，抑制较高的频率，以便较低的频率可以赶上。这将需要更多的迭代来产生一个良好的图像。为什么我们不提高梯度的低频呢? 实现这一点的一种方法是通过拉普拉斯金字塔分解。我们称得到的技术为拉普拉斯金字塔梯度归一化。

k = np.float32([1,4,6,4,1])
k = np.outer(k, k)
k5x5 = k[:,:,None,None]/k.sum()*np.eye(3, dtype=np.float32)def lap_split(img):'''将图像分割为lo（低）和hi（高）两个频率分量'''with tf.name_scope('split'):lo = tf.nn.conv2d(img, k5x5, [1,2,2,1], 'SAME')lo2 = tf.nn.conv2d_transpose(lo, k5x5*4, tf.shape(img), [1,2,2,1])hi = img-lo2return lo, hidef lap_split_n(img, n):'''Build Laplacian pyramid with n splits'''levels = []for i in range(n):img, hi = lap_split(img)levels.append(hi)levels.append(img)return levels[::-1]def lap_merge(levels):'''Merge Laplacian pyramid'''img = levels[0]for hi in levels[1:]:with tf.name_scope('merge'):img = tf.nn.conv2d_transpose(img, k5x5*4, tf.shape(hi), [1,2,2,1]) + hireturn imgdef normalize_std(img, eps=1e-10):'''Normalize image by making its standard deviation = 1.0'''with tf.name_scope('normalize'):std = tf.sqrt(tf.reduce_mean(tf.square(img)))return img/tf.maximum(std, eps)def lap_normalize(img, scale_n=4):'''Perform the Laplacian pyramid normalization.'''img = tf.expand_dims(img,0)tlevels = lap_split_n(img, scale_n)tlevels = list(map(normalize_std, tlevels))out = lap_merge(tlevels)return out[0,:,:,:]
'''
# Showing the lap_normalize graph with TensorBoard
lap_graph = tf.Graph()
with lap_graph.as_default():lap_in = tf.placeholder(np.float32, name='lap_in')lap_out = lap_normalize(lap_in)
show_graph(lap_graph)
'''

8.可视化

def render_lapnorm(t_obj, img0=img_noise, visfunc=visstd,iter_n=10, step=1.0, octave_n=3, octave_scale=1.4, lap_n=4):t_obj_scaled = tf.multiply(t_obj, tfc.to_float(tfc.log(t_obj) < .8*tf.reduce_max(t_obj)))t_score = tf.reduce_mean(t_obj_scaled) # defining the optimization objectivet_grad = tf.gradients(t_score, t_input)[0] # behold the power of automatic differentiation!# build the laplacian normalization graphlap_norm_func = tffunc(np.float32)(partial(lap_normalize, scale_n=lap_n))img = img0.copy()for octave in range(octave_n):if octave>0:hw = np.float32(img.shape[:2])*octave_scaleimg = resize(img, np.int32(hw))for i in range(iter_n):g = calc_grad_tiled(img, t_grad)g = lap_norm_func(g)img += g*stepprint('.', end = ' ')clear_output()showarray(visfunc(img))render_lapnorm(T(layer)[:,:,:,channel])

render_lapnorm(T(layer)[:,:,:,45]) #45 is polar bear? 59 is nice. 25 is ok.

render_lapnorm(T('mixed3b_1x1_pre_relu')[:,:,:,10])

9.DeepDream

def render_deepdream(t_obj, img0=img_noise,iter_n=10, step=1.5, octave_n=16, octave_scale=1.4):t_obj_scaled = tf.multiply(t_obj, tfc.to_float(tfc.log(t_obj) < .8*tf.reduce_max(t_obj)))t_score = tf.reduce_mean(t_obj_scaled) # defining the optimization objectivet_grad = tf.gradients(t_score, t_input)[0]# split the image into a number of octavesimg = img0.copy()octaves = []for i in range(octave_n-1):hw = img.shape[:2]lo = resize(img, np.int32(np.float32(hw)/octave_scale))hi = img-resize(lo, hw)img = looctaves.append(hi)# generate details octave by octavefor octave in range(octave_n):if octave>0:hi = octaves[-octave]img = resize(img, hi.shape[:2])+hifor i in range(iter_n):g = calc_grad_tiled(img, t_grad)img += g*(step / (np.abs(g).mean()+1e-7))print('.',end = ' ')clear_output()showarray(img/255.0)return img/255.0

图片可自定义

img0 = PIL.Image.open('images/timg.jpg')
img0 = np.float32(img0)
showarray(img0/255.0)

处理后的图片

a = render_deepdream(T('mixed3b_1x1_pre_relu')[:,:,:,1], img0, iter_n=5)

10.应用到音乐上

现在我们已经设置好了所有的deepdream辅助功能。我们可以使用STFT提取音频光谱图。光谱图就是一个简单的NxNx1矩阵，因此如果我们将它投射到三个通道(NxNx3)并将其值缩放到16位(0-255)，我们就可以像处理任何RGB图像一样处理它。

#!/usr/bin/env python
import librosa.display# load the audio
y, sr = librosa.load('audio/thief-original.aiff', sr=44100)
%matplotlib inline
# do the stft
nfft = 2048
hop = 256
y_stft = librosa.core.stft(y, n_fft = nfft, hop_length = hop, center=True)# Separate the magnitude and phase
y_stft_mag1, y_stft_ang = librosa.magphase(y_stft)# scale the spectrogram such that its values correspond to 0-255 (16-bit rgb amplitude)
nonlin = 1.0/8.0
y_stft_mag = np.power(y_stft_mag1, nonlin)
y_stft_mag = np.flipud((1 - y_stft_mag/y_stft_mag.max()))
y_stft_mag_rgb = np.zeros([y_stft_mag.shape[0], y_stft_mag.shape[1], 3])
y_stft_mag_rgb[:, :, 0] = y_stft_mag
y_stft_mag_rgb[:, :, 1] = y_stft_mag
y_stft_mag_rgb[:, :, 2] = y_stft_mag
img = 255*y_stft_mag_rgb# show original audio log magnitued spectrogram
librosa.display.specshow(data=np.log(np.abs(y_stft_mag1)), sr=sr, x_axis='time', y_axis='log')

# show the spectrogram as an image i.e. on a linear freq axis
showarray(img/255.0)

11.将图像作为deepdream的输入

#dream_spec = render_deepdream(T(layer)[:,:,:,channel], img)
dream_spec = render_deepdream(T('mixed3b_1x1_pre_relu')[:,:,:,13], img)

12.频谱图再合成

现在我们要把声谱图重新合成为音频(时域)。为了做到这一点，我们需要做的第一件事就是把我们把光谱图张量缩放到RGB空间所做的所有处理都反向进行。
因为我们去掉了相位，或者FFT的虚部为了使频谱图形象化，我们还必须把它加回去以产生一个有意义的重构。

# undo processing to bring the image back from 0-255 to original scale
# 撤消处理，使图像从0-255恢复到原来的比例
deepdream_out = np.flipud(dream_spec)
deepdream_out = (1 - deepdream_out) * y_stft_mag.max()
deepdream_out = np.power(deepdream_out, 1/nonlin)
# flatten the three channels and normalize over number of channels
# 将三个通道压平，在通道数目上进行归一化
deepdream_out = np.sum(deepdream_out, axis=2) / 3.0
# show the new log-spectrogram
librosa.display.specshow(np.log(np.abs(deepdream_out)), sr=sr, x_axis='time', y_axis='log')# add back in the original phase
deepdream_out_orig = deepdream_out.copy()
deepdream_out = deepdream_out * y_stft_ang

output = librosa.core.istft(deepdream_out_orig, hop_length=256, win_length=2048, center=True)
# save without phase added back in if ya want :)
librosa.output.write_wav('thief_44100_unrefined.wav', output, sr)
# add back in original phase
_, orig_stft_ang = librosa.magphase(y_stft)
deepdream_out = deepdream_out_orig * orig_stft_ang
# play with fixed phase
output = librosa.core.istft(deepdream_out, hop_length=256, win_length=2048, center=True)
#save to disk
librosa.output.write_wav('helix_dreamed_44100.wav', output, sr)

项目地址：https://github.com/markostam/audio-deepdream-tf