免费的中文语音数据集汇总列表

截止2019年01月11日

AISHELL-1
AISHELL-2（高校与研究机构免费申请）
THCHS30
ST-CMDS
Primewords Chinese Corpus Set 1

截至2020年7月7日

Speechocean 10 Hours Chinese Mandarin Speech Recognition Corpus
MobvoiHotwords
CN-Celeb
MAGICDATA

下载地址：openslr上都有，除了aishell-2。aishell-2可以向希尔公司申请或购买。

Dataset	Duration（hours）	Description
AISHELL-1	178	AISHELL-ASR0009录音文本涉及智能家居、无人驾驶、工业生产等11个领域。录制过程在安静室内环境中，同时使用3种不同设备：高保真麦克风（44.1kHz，16-bit）；Android系统手机（16kHz，16-bit）；iOS系统手机（16kHz，16-bit）。高保真麦克风录制的音频降采样为16kHz，用于制作AISHELL-ASR0009-OS1。400名来自中国不同口音区域的发言人参与录制。经过专业语音校对人员转写标注，并通过严格质量检验，此数据库文本正确率在95%以上。分为训练集、开发集、测试集。（支持学术研究，未经允许禁止商用。）
AISHELL-2	1000	希尔贝壳中文普通话语音数据库AISHELL-2的语音时长为1000小时，其中718小时来自AISHELL-ASR0009-[ZH-CN]，282小时来自AISHELL-ASR0010-[ZH-CN]。录音文本涉及唤醒词、语音控制词、智能家居、无人驾驶、工业生产等12个领域。录制过程在安静室内环境中，同时使用3种不同设备：高保真麦克风（44.1kHz，16bit）；Android系统手机（16kHz，16bit）；iOS系统手机（16kHz，16bit）。AISHELL-2采用iOS系统手机录制的语音数据。1991名来自中国不同口音区域的发言人参与录制。经过专业语音校对人员转写标注，并通过严格质量检验，此数据库文本正确率在96%以上。（支持学术研究，未经允许禁止商用。）
AISHELL-EVAL （AISHELL2-2018A-EVAL）		TEST DATA: 5000 utterances from 10 speakers DEV DATA: 2500 utterances from 5 speaker Sampling Rate : 16kHz Sample Format : 16bit Environment : Indoor Speech Data Type : PCM Channel Number : 1 Recording Equipment : iOS / Android / High Fidelity Microphone
THCHS30	30	THCHS30 is an open Chinese speech database published by Center for Speech and Language Technology (CSLT) at Tsinghua University.
ST-CMDS	500	A free Chinese Mandarin corpus by Surfingtech (www.surfing.ai), containing utterances from 855 speakers, 102600 utterances.This corpus were recorded in silence in-door environment using cellphone. It has 855 speakers. Each speaker has 120 utterances. All utterances were carefully transcribed and checked by human. Transcription accuracy is guaranteed.
Primewords Chinese Corpus Set 1	100	This free Chinese Mandarin speech corpus set is released by Shanghai Primewords Information Technology Co., Ltd.The corpus is recorded by smart mobile phones from 296 native Chinese speakers. The transcription accuracy is larger than 98%, at the confidence level of 95%. It is free for academic use.The mapping between the transcript and utterance is given in JSON format.
Speechocean 10 Hours Chinese Mandarin Speech Recognition Corpus	10.33	The Chinese Mandarin speech recognition corpus is provided by speechocean. This is a 10.33 hours corpus, which is collected over 4 different microphones simultaneously. The corpus was recorded by 20 speakers (10 males and 10 females) in a quiet office. Each speaker was recorded around 120 utterances in one channel. Transcription files are included. The sentence transcription accuracy is higher than 98%. It is totally free to use for academic purpose. This corpus is a subset of a bigger corpus (159 hours). Please contact us if you are interested.
MobvoiHotwords		The MobvoiHotwords is a corpus of wake-up words collected from a commercial smart speaker of Mobvoi. It consists of keyword and non-keyword utterances. For keyword data, keyword utterances contain either 'Hi xiaowen' or 'Nihao Wenwen' are collected. For each keyword, there are about 36k utterances. All keyword data is collected from 788 subjects, ages 3-65, with different distances from the smart speaker (1, 3 and 5 meters). Different noises (typical home environment noises like music and TV) with varying sound pressure levels are played in the background during the collection.
CN-Celeb		This data is a large-scale speaker recognition dataset collected 'in the wild'. The dataset contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world. All the audio files are coded as single channel and sampled at 16kHz with 16-bit precision.
MAGICDATA	755	The corpus contains 755 hours of speech data, which is mostly mobile recorded data. 1080 speakers from different accent areas in China are invited to participate in the recording. The sentence transcription accuracy is higher than 98%. Recordings are conducted in a quiet indoor environment. The database is divided into training set, validation set, and testing set in a ratio of 51: 1: 2. Detail information such as speech data coding and speaker information is preserved in the metadata file. The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command and control, etc. Segmented transcripts are also provided.

免费的中文语音数据集汇总列表相关推荐

“智源-MagicSpeechNet 家庭场景中文语音数据集挑战赛”上线
2019 年 12 月,北京智源人工智能研究院联合爱数智慧和数据评测平台 Biendata,共同发布了"智源 MagicSpeechNet 家庭场景中文语音数据集",其中包含数百小 ...
CN-Celeb 无约束条件说话人识别的中文语音数据集
CN-Celeb 无约束条件说话人识别的中文语音数据集数据源:http://www.openslr.org/82/ 项目源:http://cslt.riit.tsinghua.edu.cn/medi ...
情感分析︱网络公开的免费文本语料训练数据集汇总
每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- 包括:一些免费的语料库+一些有效分词软件还有 ...
语音数据集 | Speech datasets
原文链接如下: 免费中文语音数据集几个最新免费开源的中文语音数据集语音数据集国内最好的语音数据集: openSLR数据集下载链接一个不错的英语语音数据集网站: Speech datasets ...
实时中文语音克隆｜开源项目MockingBird体验
lake2 引子在今年大型网络攻防演练前不久,笔者接到一个公司的座机号码来电,上来就问防守准备得怎么样了,哪里还有不足等.等等,这声音不认识,笔者第一反应就是蓝军(Red Team)来进行社会工程攻 ...
webhub123整理中文语音识别数据集
我们收集和整理了常用的中文语音识别数据集,合计超过12000+小时的数据集.已经按照不同来源整理收录到 webhub123整理中文语音识别数据集https://www.webhub123.com/ ...
AI最全数据集汇总：语音、歌声、音乐、图片、视频等领域开源数据集链接汇总
文章目录 **音乐数据集** 百万歌数据集 **语音数据集** 口语维基百科语料库语音命令数据集零资源语音挑战 ISOLET数据集阿拉伯语言语料库 TIMIT语料库 **音响/自然** 环境音频 ...
中文/英文文本相似度/文本推理/文本匹配数据集汇总（SNLI、MSRP、MultiNLI、Quora、SciTail、SICK、STS、CCKS2018、LCQMC、OCNLI、XNLI）
中文/英文文本相似度/文本推理/文本匹配数据集汇总(SNLI.MSRP.MultiNLI.Quora.SciTail.SICK.STS.CCKS2018.LCQMC.OCNLI.XNLI) 1. 所 ...
使用deepspeech.pytorch项目对中文普通话数据集进行语音转文字
目录介绍注意事项实验过程 thchs30 aishell Primewords Chinese Corpus Set 1 Free ST Chinese Mandarin Corpus Aida ...
收藏 | 机器学习数据集汇总收集
点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达仅作分享,不代表本公众号立场,侵权联系删除转载于:机器学习算法与 ...

免费的中文语音数据集汇总列表

免费的中文语音数据集汇总列表相关推荐

最新文章

热门文章