【语音信号处理】3语音信号可视化——prosody
语音特征提取与可视化
- 1. 准备
- 2. 最终结果
- 3. 代码
1. 准备
在之前的工作上增加了韵律的相关特征提取,之前可见:【语音信号处理】1语音信号可视化——时域、频域、语谱图、MFCC详细思路与计算、差分
安装一下这个库:
pip install praat-parselmouth
还有其他的一些,反正缺啥安啥:
pip install pydub
pip install python_speech_features
pip install librosa==0.9
2. 最终结果
3. 代码
"""
This script contains supporting function for the data processing.
It is used in several other scripts:
for generating bvh files, aligning sequences and calculation of speech features
"""import librosa
import librosa.displayfrom pydub import AudioSegment # TODO(RN) add dependency!
import parselmouth as pm # TODO(RN) add dependency!
from python_speech_features import mfcc
import scipy.io.wavfile as wavimport numpy as np
import scipyNFFT = 1024
MFCC_INPUTS=26 # How many features we will store for each MFCC vector
WINDOW_LENGTH = 0.1 #s
SUBSAMPL_RATE = 9def derivative(x, f):""" Calculate numerical derivative (by FDM) of a 1d arrayArgs:x: input space xf: Function of xReturns:der: numerical derivative of f wrt x"""x = 1000 * x # from seconds to milliseconds# Normalization:dx = (x[1] - x[0])cf = np.convolve(f, [1, -1]) / dx# Remove unstable valuesder = cf[:-1].copy()der[0] = 0return derdef create_bvh(filename, prediction, frame_time):"""Create BVH FileArgs:filename: file, in which motion in bvh format should be writtenprediction: motion sequences, to be written into fileframe_time: frame rate of the motionReturns:nothing, writes motion to the file"""with open('hformat.txt', 'r') as ftemp:hformat = ftemp.readlines()with open(filename, 'w') as fo:prediction = np.squeeze(prediction)print("output vector shape: " + str(prediction.shape))offset = [0, 60, 0]offset_line = "\tOFFSET " + " ".join("{:.6f}".format(x) for x in offset) + '\n'fo.write("HIERARCHY\n")fo.write("ROOT Hips\n")fo.write("{\n")fo.write(offset_line)fo.writelines(hformat)fo.write("MOTION\n")fo.write("Frames: " + str(len(prediction)) + '\n')fo.write("Frame Time: " + frame_time + "\n")for row in prediction:row[0:3] = 0legs = np.zeros(24)row = np.concatenate((row, legs))label_line = " ".join("{:.6f}".format(x) for x in row) + " "fo.write(label_line + '\n')print("bvh generated")def shorten(arr1, arr2, min_len=0):if min_len == 0:min_len = min(len(arr1), len(arr2))arr1 = arr1[:min_len]arr2 = arr2[:min_len]return arr1, arr2def average(arr, n):""" Replace every "n" values by their averageArgs:arr: input arrayn: number of elements to average onReturns:resulting array"""end = n * int(len(arr)/n)return np.mean(arr[:end].reshape(-1, n), 1)def calculate_spectrogram(audio_filename):""" Calculate spectrogram for the audio fileArgs:audio_filename: audio file nameduration: the duration (in seconds) that should be read from the file (can be used to load just a part of the audio file)Returns:log spectrogram values"""DIM = 64audio, sample_rate = librosa.load(audio_filename)# Make stereo audio being monoif len(audio.shape) == 2:audio = (audio[:, 0] + audio[:, 1]) / 2spectr = librosa.feature.melspectrogram(y=audio, sr=sample_rate, window = scipy.signal.hanning,#win_length=int(WINDOW_LENGTH * sample_rate),hop_length = int(WINDOW_LENGTH* sample_rate / 2),fmax=7500, fmin=100, n_mels=DIM)# Shift into the log scaleeps = 1e-10log_spectr = np.log(abs(spectr)+eps)return np.transpose(log_spectr)def calculate_mfcc(audio_filename):"""Calculate MFCC features for the audio in a given fileArgs:audio_filename: file name of the audioReturns:feature_vectors: MFCC feature vector for the given audio file"""fs, audio = wav.read(audio_filename)# Make stereo audio being monoif len(audio.shape) == 2:audio = (audio[:, 0] + audio[:, 1]) / 2# Calculate MFCC feature with the window frame it was designed forinput_vectors = mfcc(audio, winlen=0.02, winstep=0.01, samplerate=fs, numcep=MFCC_INPUTS, nfft=NFFT)input_vectors = [average(input_vectors[:, i], 5) for i in range(MFCC_INPUTS)]feature_vectors = np.transpose(input_vectors)return feature_vectorsdef extract_prosodic_features(audio_filename):"""Extract all 5 prosodic featuresArgs:audio_filename: file name for the audio to be usedReturns:pros_feature: energy, energy_der, pitch, pitch_der, pitch_ind"""WINDOW_LENGTH = 5# Read audio from filesound = AudioSegment.from_file(audio_filename, format="wav")# Alternative prosodic featurespitch, energy = compute_prosody(audio_filename, WINDOW_LENGTH / 1000)duration = len(sound) / 1000t = np.arange(0, duration, WINDOW_LENGTH / 1000)energy_der = derivative(t, energy)pitch_der = derivative(t, pitch)# Average everything in order to match the frequencyenergy = average(energy, 10)energy_der = average(energy_der, 10)pitch = average(pitch, 10)pitch_der = average(pitch_der, 10)# Cut them to the same sizemin_size = min(len(energy), len(energy_der), len(pitch_der), len(pitch_der))energy = energy[:min_size]energy_der = energy_der[:min_size]pitch = pitch[:min_size]pitch_der = pitch_der[:min_size]# Stack them all togetherpros_feature = np.stack((energy, energy_der, pitch, pitch_der))#, pitch_ind))# And reshapepros_feature = np.transpose(pros_feature)return pros_featuredef compute_prosody(audio_filename, time_step=0.05):print(pm.__file__)audio = pm.Sound(audio_filename)# Extract pitch and intensitypitch = audio.to_pitch(time_step=time_step)intensity = audio.to_intensity(time_step=time_step)# Evenly spaced time stepstimes = np.arange(0, audio.get_total_duration() - time_step, time_step)# Compute prosodic features at each time steppitch_values = np.nan_to_num(np.asarray([pitch.get_value_at_time(t) for t in times]))intensity_values = np.nan_to_num(np.asarray([intensity.get_value(t) for t in times]))intensity_values = np.clip(intensity_values, np.finfo(intensity_values.dtype).eps, None)# Normalize features [Chiu '11]pitch_norm = np.clip(np.log(pitch_values + 1) - 4, 0, None)intensity_norm = np.clip(np.log(intensity_values) - 3, 0, None)return pitch_norm, intensity_normdef read_wav(path_wav):import wavef = wave.open(path_wav, 'rb')params = f.getparams()nchannels, sampwidth, framerate, nframes = params[:4] # 通道数、采样字节数、采样率、采样帧数voiceStrData = f.readframes(nframes)waveData = np.frombuffer(voiceStrData, dtype=np.short) # 将原始字符数据转换为整数waveData = waveData * 1.0 / max(abs(waveData)) # 音频数据归一化, instead of .fromstringwaveData = np.reshape(waveData, [nframes, nchannels]).T # .T 表示转置, 将音频信号规整乘每行一路通道信号的格式,即该矩阵一行为一个通道的采样点,共nchannels行f.close()return waveData, nframes, framerateimport matplotlib.pyplot as plt
def draw_time_domain_image(x1, waveData, nframes, framerate): # 时域特征time = np.arange(0,nframes) * (1.0/framerate)# plt.figure(1)x1.plot(time,waveData[0,:],c='b')plt.xlabel('time')plt.ylabel('am')# plt.show()def draw_frequency_domain_image(x2, waveData): # 频域特征fftdata = np.fft.fft(waveData[0, :])fftdata = abs(fftdata)hz_axis = np.arange(0, len(fftdata))# plt.figure(2)x2.plot(hz_axis, fftdata, c='b')plt.xlabel('hz')plt.ylabel('am')# plt.show()def draw_Spectrogram(x3, waveData, framerate): # 语谱图framelength = 0.025 # 帧长20~30msframesize = framelength * framerate # 每帧点数 N = t*fs,通常情况下值为256或512,要与NFFT相等, 而NFFT最好取2的整数次方,即framesize最好取的整数次方nfftdict = {}lists = [32, 64, 128, 256, 512, 1024]for i in lists: # 找到与当前framesize最接近的2的正整数次方nfftdict[i] = abs(framesize - i)sortlist = sorted(nfftdict.items(), key=lambda x: x[1]) # 按与当前framesize差值升序排列framesize = int(sortlist[0][0]) # 取最接近当前framesize的那个2的正整数次方值为新的framesizeNFFT = framesize # NFFT必须与时域的点数framsize相等,即不补零的FFToverlapSize = 1.0 / 3 * framesize # 重叠部分采样点数overlapSize约为每帧点数的1/3~1/2overlapSize = int(round(overlapSize)) # 取整spectrum, freqs, ts, fig = x3.specgram(waveData[0], NFFT=NFFT, Fs=framerate, window=np.hanning(M=framesize),noverlap=overlapSize, mode='default', scale_by_freq=True, sides='default',scale='dB', xextent=None) # 绘制频谱图plt.ylabel('Frequency')plt.xlabel('Time(s)')plt.title('Spectrogram')# plt.show()def mfcc_librosa(ax, path):y, sr = librosa.load(path, sr=None)'''librosa.feature.mfcc(y=None, sr=22050, S=None, n_mfcc=20, dct_type=2, norm='ortho', **kwargs)y:声音信号的时域序列sr:采样频率(默认22050)S:对数能量梅尔谱(默认为空)n_mfcc:梅尔倒谱系数的数量(默认取20)dct_type:离散余弦变换(DCT)的类型(默认为类型2)norm:如果DCT的类型为是2或者3,参数设置为"ortho",使用正交归一化DCT基。归一化并不支持DCT类型为1kwargs:如果处理时间序列输入,参照melspectrogram返回:M:MFCC序列'''mfcc_data = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)ax.matshow(mfcc_data)plt.title('MFCC')# plt.show()from scipy.io import wavfile
from python_speech_features import mfcc, logfbank
def mfcc_python_speech_features(ax, path):sampling_freq, audio = wavfile.read(path) # 读取输入音频文件mfcc_features = mfcc(audio, sampling_freq) # 提取MFCC和滤波器组特征filterbank_features = logfbank(audio, sampling_freq) # numpy.ndarray, (999, 26)print(filterbank_features.shape) # (200, 26)print('\nMFCC:\n窗口数 =', mfcc_features.shape[0])print('每个特征的长度 =', mfcc_features.shape[1])print('\nFilter bank:\n窗口数 =', filterbank_features.shape[0])print('每个特征的长度 =', filterbank_features.shape[1])mfcc_features = mfcc_features.T # 画出特征图,将MFCC可视化。转置矩阵,使得时域是水平的ax.matshow(mfcc_features)plt.title('MFCC')filterbank_features = filterbank_features.T # 将滤波器组特征可视化。转置矩阵,使得时域是水平的ax.matshow(filterbank_features)plt.title('Filter bank')plt.show()if __name__ == "__main__":Debug=1if Debug:audio_filename = "your path.wav"# feature = calculate_spectrogram(audio_filename)waveData, nframes, framerate = read_wav(audio_filename)ax1 = plt.subplot(3, 3, 1)draw_time_domain_image(ax1, waveData, nframes, framerate)ax2 = plt.subplot(3, 3, 2)draw_frequency_domain_image(ax2, waveData)ax3 = plt.subplot(3, 3, 3)draw_Spectrogram(ax3, waveData, framerate)ax4 = plt.subplot(3, 3, 4)mfcc_librosa(ax4, audio_filename)x = calculate_spectrogram(audio_filename)print(x.shape) # (145, 64)ax5 = plt.subplot(3, 3, 5)ax5.plot(x)x = calculate_mfcc(audio_filename)print(x.shape) # (143, 26)ax6 = plt.subplot(3, 3, 6)ax6.plot(x)x = extract_prosodic_features(audio_filename)print(x.shape) # (143, 4)ax7 = plt.subplot(3, 3, 7)ax7.plot(x)x, y = compute_prosody(audio_filename, time_step=0.05)print(x.shape) # (143,)print(y.shape) # (143,)ax8 = plt.subplot(3, 3, 8)ax8.plot(x)ax9 = plt.subplot(3, 3, 9)ax9.plot(y)plt.tight_layout()plt.savefig('1.jpg')
【语音信号处理】3语音信号可视化——prosody相关推荐
- 语音信号处理之语音特征提取(1)机器学习的语音处理
本文首先是将Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFC ...
- 语音信号处理:语音增强DNN频谱映射
本文为自学总结整理知识点使用 参考课程: 基于深度神经网络频谱映射的语音增强方法 引言 原理 数据集 语音数据集 TIMIT 噪声数据集 Noise-92 数据准备 无噪语音数据准备 生成含噪数据 噪 ...
- 『语音信号处理』语音库 librosa 学习
librosa 前言 音频读取 重采样 读取时长 写音频 过零率 波形图 短时傅里叶变换 短时傅里叶逆变换 幅度转dB 功率转dB 频谱图 Mel滤波器组 梅尔频谱 提取MFCC系数 前言 安装 li ...
- python语音信号处理_现代语音信号处理笔记 (一)
本系列笔记对胡航老师的现代语音信号处理这本书的语音处理部分进行总结,包含语音信号处理基础.语音信号分析.语音编码三部分.一开始以为三部分总结到一篇文章里就可以了,但写着写着发现事情并没有那么简单... ...
- 语音信号处理_书单 | 语音研究进阶指南
作为人类最自然的交流方式,"听"和"说"包括了人类大脑皮层从听觉感知到语言处理和理解,再到声音生成这个"神奇"的认知过程.语音领域的探索和研 ...
- 语音信号处理复习2、语音声学基础
语音声学基础 什么是声音 声音是一种空气振动产生的波. 频率(Frequency) 单位时间内,声波的周期数,Hz表示 振幅(Amplitude ) 波振动的大小,一般用dB表示 语音产生 声带(Vo ...
- 现代语音信号处理笔记 (一)
欢迎大家关注我的博客 http://pelhans.com/ ,所有文章都会第一时间发布在那里哦~ 本系列笔记对胡航老师的现代语音信号处理这本书的语音处理部分进行总结,包含语音信号处理基础.语音信号分 ...
- 【语音信号处理】1语音信号可视化——时域、频域、语谱图、MFCC详细思路与计算、差分
基本语音信号处理操作入门 1. 数据获取 2. 语音信号可视化 2.1 时域特征 2.2 频域特征 2.3 语谱图 3. 倒谱分析 4. 梅尔系数 4.1 梅尔频率倒谱系数 4.2 Mel滤波器原理 ...
- 数字语音信号处理学习笔记——语音信号的短时时域分析(4)
版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/u013538664/article/details/26141939 3.7 基于能量和过零率 ...
- 基于matlab的语音信号,科学网—[转载]【信息技术】【2014.06】【含源码】基于MATLAB的语音信号处理与分析 - 刘春静的博文...
本文为瑞典耶夫勒大学(作者:Nan Wu)的学士论文,共48页. 语音传递是人类最重要.最有效.最常用的信息交流方式.语言是人类特有的特征,而人声是常用的工具,也是相互传递信息的重要途径.语音具有较大 ...
最新文章
- usaco ★Fractions to Decimals 分数化小数
- Hibernate Criterion
- 28、OSPF配置实验之负载均衡
- 3维线程格 gpu_论文导读 | GPU加速子图同构算法
- HDU-1028 Ignatius and the Princess III(生成函数)
- NHibernate 对分组聚合支持的不好
- spring的基本配置和使用
- windows下配置cvs服务端
- 人工智能将进入能源生产领域
- 2021-2025年中国串级太阳能逆变器行业市场供需与战略研究报告
- Spark 0.9.1 MLLib 机器学习库简介
- Centos系统安装踩坑
- LDAP 统一认证 单点登录学习
- 多模态机器翻译 | (1) 简介
- 手把手教你---猿如意之八大高效利器使用
- 软件测试最容易陷入的28个误区
- ws office ppt基础知识
- 跨平台音视频jQuery插件:jPlayer
- 智能手机拍照及视频DXO mark排名
- wd 文件服务器客服电话,wd 云服务器
热门文章
- Bicubic介绍及Python实现
- git输入 ssh-keygen -t rsa 后只显示Generating public/private rsa key pair. 然后就直接跳出了
- spyder5 加载完毕闪退
- openstack RPM打包
- python 判断素数
- MUI框架学习(2)-页面间传值
- Win 10 清除恢复分区
- UnicodeDecodeError: 'gbk' codec can't decode byte 0xfe in position 575056: illegal multibyte sequenc
- 网络io,select,poll与epoll的初步认识
- 如何正确的制定目标?(只需4步)