国内外最好的语料库汇总

语料在语言学科研究和深度学习中都至关重要，下面对常用的语料库资源进行总结：部分信息来源于其他博客，但是本文会保持持续更新

Open Speech and Language Resources
http://www.openslr.org/resources.php

更新(2020年6月10)：

若干开源语音数据库： https://blog.ailemon.me/2018/11/21/free-open-source-chinese-speech-datasets/

更新2020/10/23

AISHELL-3 高保真中文语音数据库（希尔贝壳中文普通话语音数据库AISHELL-3的语音时长为85小时88035句，可做为多说话人合成系统。录制过程在安静室内环境中，使用高保真麦克风（44.1kHz，16bit）。218名来自中国不同口音区域的发言人参与录制。专业语音校对人员进行拼音和韵律标注，并通过严格质量检验，此数据库音字确率在98%以上。（支持学术研究，未经允许禁止商用。））
DiDiSpeech: A Large Scale Mandarin Speech Corpus It consists of about 800 hours of speech data at 48kHz sampling rate from 6000 speakers and the corresponding texts. All speech data in the corpus was recorded in quiet environment and is suitable for various speech processing tasks, such as voice conversion, multi-speaker text-to-speech and automatic speech recognition.

NHSS: A Speech and Singing Parallel
We present a database of parallel recordings of speech and singing, collected and released by the Human Language Technology (HLT) laboratory at the National University of Singapore (NUS), that is called NUS-HLT Speak-Sing (NHSS) database. This database consists of recordings of sung vocals of English pop songs, the spoken counterpart of lyrics of the songs read by the singers in their natural reading manner, and manually prepared utterance-level and word-level annotations. The audio recordings in the NHSS database correspond to a total of 100 songs sung and spoken by 10 singers, resulting in total of 7 hours audio data. There are 5 male and 5 female singers, singing and reading the lyrics of 10 songs each. We release this database to the public for research activities.

更新2020/12/25
http://www.openslr.org/82/
多场景说话人识别数据集CN-Celeb ，包含了来自3000 名中国明星在采访、歌舞、音乐、影视等各类场景中的语音片段。CN-Celeb2 的采集流程与 CN-Celeb1 相仿，语音片段全部由各个数据源经过自动化处理程序提取，并通过人工校验得到。整个 CN-Celeb 系列覆盖了噪音、信道、发音方式等各方面的复杂性，特别适用于研究复杂场景下的说话人识别技术。

更新2021/02/10
数据集名称：speechocean762

数据集下载链接为：http://www.openslr.org/101/ ，其对应的Kaldi recipe入口为：egs/gop_speechocean762（数据介绍：
小米语音联合海天瑞声开源了业界首个比较完善的英语发音评测公开数据集
数据集语言：中国人讲英语，样本均衡，内容完善，数据集包含5000个英文句子，内容涵盖日常生活多个方面；由250位英语非母语发音人录制，其母语均为普通话；发音人性别、年龄占比均衡，男女比例1:1，儿童及成年发音人比例1:1；发音人英语水平经过严格设计及筛选，好、中、差比例为2:1:1，可保证对不同程度英语发音学习者的反馈测试。
）

标贝开源：

https://www.data-baker.com/#/data/index/source
有效时长：约12小时
平均字数：16字
语言类型：标准普通话
发音人：女；20-30岁；声音积极知性
录音环境：声音采集环境为专业录音棚环境：1）录音棚符合专业音库录制标准；2）录音环境和设备自始至终保持不变；3）录音环境的信噪比不低于35dB。

cmudict

http://www.speech.cs.cmu.edu/cgi-bin/cmudict

粤语NLP：
https://github.com/CanCLID/awesome-cantonese-nlp

IPA：
https://en.wikipedia.org/wiki/Pinyin
https://github.com/untunt/PhonoCollection/blob/master/Standard%20Chinese.md

更新2021/06/22
开源中英双语多说话人的情感 VC数据库 Emotional Voice Conversion: Theory, Databases and ESD( https://arxiv.org/abs/2105.14762 )

更新2021/09/01
RyanSpeech Corpus

RyanSpeech is a new speech corpus for research on automated text-to-speech (TTS) systems. Publicly available TTS corpora are often noisy, recorded with multiple speakers, or do not have quality male speech data. In order to meet the need for a high-quality

http://mohammadmahoor.com/ryanspeech/

EVC：
https://arxiv.org/pdf/2105.14762.pdf

更新2021/09/08
Aishell4
http://www.aishelltech.com/aishell_4
AISHELL-4是一个通过麦克风阵列实录的八通道中文普通话会议场景语音数据集。该数据集共包含211场会议，每场会议4至8人，数据集共120小时左右。该数据集旨在促进实际应用场景下多说话人处理的研究。AISHELL-4数据包括了实际会议场景下各种重要特性，例如停顿、重叠、说话人轮转、噪声等。同时数据集提供了准确的音字转写文本及时间戳信息，方便研究者进行诸如前端处理、语音识别、说话人分割等单独任务，并可以进行联合优化。

更新2021/10/14
WenetSpeech

A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
https://wenet-e2e.github.io/WenetSpeech/

更新2022/02/15
更新几个英文语料库
LibriTTS corpus
http://openslr.magicdatatech.com/60/
Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus

common voice
https://commonvoice.mozilla.org/zh-CN/datasets

Hi-Fi Multi-Speaker English TTS Dataset (Hi-Fi TTS)
http://www.openslr.org/109/
About this resource:
Hi-Fi Multi-Speaker English TTS Dataset (Hi-Fi TTS) is a multi-speaker English dataset for training text-to-speech models. The dataset is based on public audiobooks from LibriVox and texts from Project Gutenberg.
The Hi-Fi TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.

Free ST American English Corpus
http://www.openslr.org/45/

VCTK
CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)
https://datashare.ed.ac.uk/handle/10283/3443

RyanSpeech Corpus

http://mohammadmahoor.com/ryanspeech/

M-AILABS

Most of the data is based on LibriVox and Project Gutenberg. The training data consist of nearly thousand hours of audio and the text-files in prepared format.

https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/

多个数据目录：
http://openslr.magicdatatech.com/resources.php

https://github.com/coqui-ai/open-speech-corpora

其他：

国外语料库 ❀❀❀

BNC——英国国家语料库（British National Corpus）：http://www.natcorp.ox.ac.uk/

BOE——柯林斯英语语料库（the Bank of English）：http://www.collinslanguage.com/wordbanks/

联合国文件数据库（提供80万份六种语言平行文档）http://documents.un.org/simple.asp

ANC——美国国家语料库（American National Corpus）:http://www.anc.org/

兰开斯特汉语语料库 (LCMC) http://ota.oucs.ox.ac.uk/s/download.php?otaid=2474

OLAC语言开发典藏社群（Open Language Archives Community）http://search.language-archives.org/index.html

COCA———美国当代英语语料库(Corpus of Contemporary American English)

http://www.americancorpus.org/

COHA——美国近当代英语语料库（Corpus of Historical American English）：http://corpus.byu.edu.coha/

SKETCHENGINE多语言语料库：

www.sketchengine.co.uk

BASE——英国学术口语语料库（British Academic Spoken English Corpus）：http://www2.warwick.ac.uk/fac/soc/celte/research/base/

Leeds: http://corpus.leeds.ac.uk/internet.html

JustTheWord： http://193.133.140.102/JustTheWord/index.html

Lextutor: http://www.lextutor.ca/

Web Concordancer: www.edict.com.hk

国内语料库 ❀❀❀
BCC语料库：http://bcc.blcu.edu.cn/

语料库：http://yulk.org/

语料库在线：http://www.cncorpus.org/

北京大学中国语言学研究中心：http://ccl.pku.edu.cn/corpus.asp

国家语委现代汉语语料库http://www.cncorpus.org/

北外语料库语言学：http://www.bfsu-corpus.org/

古代汉语语料库http://www.cncorpus.org/login.aspx

语料库语言学在线：http://ccl.pku.edu.cn/corpus.asp

《人民日报》标注语料库http://www.icl.pku.edu.cn/icl_res/

汉语国际教育技术研发中心：HSK动态作文语料库http://202.112.195.192:8060/hsk/login.asp

语言研究所：北京口语语料查询系统（B J K Y）http://www.blcu.edu.cn/yys/6_beijing/6_beijing_chaxun.asp

现代汉语平衡语料库http://www.sinica.edu.tw/SinicaCorpus/

古汉语语料库http://www.sinica.edu.tw/ftms-bin/ftmsw

近代汉语标记语料库http://www.sinica.edu.tw/Early_Mandarin/

树图数据库http://treebank.sinica.edu.tw/

中英双语知识本体词网http://bow.sinica.edu.tw/

搜文解字：http://words.sinica.edu.tw/

文国寻宝记：http://www.sinica.edu.tw/wen/

唐诗三百首http://cls.admin.yzu.edu.tw/300/

汉籍电子文献http://www.sinica.edu.tw/~tdbproj/handy1/

红楼梦网络教学研究数据中心http://cls.hs.yzu.edu.tw/HLM/home.htm

中国传媒大学文本语料库检索系统：http://ling.cuc.edu.cn/RawPub/

哈工大信息检索研究室对外共享语料库资源http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

香港教育学院语言资讯科学中心及其语料库实验室http://www.livac.org/index.php?lang=sc

中文语言资源联盟http://www.chineseldc.org/

杨百翰大学语料库http://view.byu.edu/

国内外最好的语料库汇总相关推荐

国内外AI绘画软件汇总
国内外AI绘画软件汇总 Disco Diffusion 一款利用人工智能深度学习进行数字艺术创作的工具,它是基于 MIT 许可协议的开源工具,可以用于商业用途.可以在 Google Drive 直接运 ...
国内外开源社区资源汇总
国内外开源社区资源汇总主要在这里汇总一下国内外开源的社 ...
国内外遥感卫星整理汇总
国内外遥感卫星整理汇总国外卫星哨兵系列Sentinel WorldView系列 Spot系列 Landsat系列 RapidEye遥感卫星 QuickBird遥感卫星 Planet遥感卫星 Ple ...
2021 国内外 IoT 物联网平台汇总
物联网是继互联网.移动互联网之后的又一个万亿级的大市场.物联网平台作为物联网应用的基础设施,既管理设备.处理数据,又连接用户,必然是各路云厂商的兵家必争之地. 今天小编就为大家梳理了国内外主流的 Io ...
2月国内外CTF比赛时间汇总来了！
● 从事网络安全行业工作,怎么能不参加一次CTF比赛了! 小编作为一个CTF比赛老鸟,以每次都能做出签到题为荣! 下面给大家分享一下2月份CTF比赛时间,比赛按时间先后排序,国内国外的都有哦! 文章末 ...
3月国内外CTF比赛时间汇总来了
从事网络安全行业工作,怎么能不参加一次CTF比赛了! 小编作为一个CTF,以每次都能做出签到题为荣! 下面给大家分享一下3月份CTF比赛时间,比赛按时间先后排序,国内国外的都有哦! 国内CTF比赛汇总 ...
格物钛数据平台国内外经典开源数据汇总（自动驾驶、目标检测、人脸识别、人体姿态估计、文本检测、NLP、医疗）
本文整理了国内外经典的开源数据,包含了目标检测.自动驾驶.人脸识别.自然语言处理.文本检测.医疗等方向,具体如下. 一.自动驾驶领域数据集 KITTI数据集 KITTI数据集由德国卡尔斯鲁厄理工学院和 ...
diy直立双足机器人_速看！近期国内外机器人资讯大汇总
导语三季度即将过半,二季度机器人事件回顾,智览行业发展.二季度机器人行业呈现出了怎样的发展动态呢?从行业大角度出发,带领大家回顾过去二季度的行业精彩. 国内钛米机器人钛元助力首款人工智能5G农业机 ...
国内外网络安全厂商大汇总
防火墙/UTM/安全网关/下一代防火墙天融信.山石网科.启明星辰.网御星云.绿盟科技.安恒信息.蓝盾.华为.软云神州.杭州迪普.华清信安.东软.上讯信息.利谱.深信服.360.卫士通.H3C.交大捷 ...
大模型的1000+篇文章总结
大模型的1000+篇文章总结本文收集和总结了有关大模型的1000+篇文章,由于篇幅有限只能总结近期的内容,想了解更多内容可以访问:http://www.ai2news.com/, 其分享了有关AI的 ...

国内外最好的语料库汇总

国内外最好的语料库汇总相关推荐

最新文章

热门文章